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Executive  Summary 


Introduction 

The  ability  to  automatically  discover  relationships  contained  within  data,  quantify  their  strength, 
and  present  them  graphically  to  the  user  for  visualization  is  defined  as  “Relationship  Discovery". 
This  capability  was  the  major  research  effort  during  Phase  I  of  this  SBIR  Project. 

Relationship  Discovery  is  the  first  step  in  HNC’s  proprietary  Data  Base  Mining  process.  It 
determines  “what  is  important”  in  the  data  and  estimates  the  strength  of  the  relationships  between 
the  variables.  This  detection  of  relationships  is  the  necessary  precursor  to  the  modeling  step  where 
the  detected  relationships  are  modeled. 

The  approach  uses  a  self  organizing  neural  network  technique  to  approximate  the  probability 
density  function  (PDF)  of  the  data  set  This  approximation  of  the  PDF  is  necessary  on  large  data 
sets  to  reduce  the  size  of  the  problem  and  make  it  computationally  tractable.  The  two-dimensional 
projections  of  the  PDF  are  then  automatically  examined  for  the  existence  of  relationships  and  a 
relationship  strength  value  is  assigned  to  the  projection.  The  relationships  with  the  highest  strength 
are  presented  either  rapidly  as  a  two-dimensional  scatter  plot  or  as  a  three-dimensional  graph  of  the 
PDF,  with  the  Z  dimension  being  the  amplitude  of  the  PDF  at  each  point 

This  approach  is  similar  to  correlation  analysis,  but  provides  more  accurate  and  useful  results  since 
it  can  accommodate  nonlinear  as  well  as  linear  relationships.  Additionally,  this  approach  can  handle 
relationships  that  arc  discontinuous  or  exist  only  over  a  limited  range  of  values.  Unlike 
conventional  statistical  approaches  based  on  correlation  analysis,  the  proposed  Relationship 
Discovery  technique  makes  no  assumptions  about  the  nature  of  any  functional  or  nonfunctional 
relationships  contained  within  the  data.  Instead,  the  capability  is  driven  by  the  underlying 
probability  density  of  the  data,  thereby  giving  a  true  representation  of  any  relationships. 


Obitttives  aLEhaax  l 

The  Phase  I  effort  for  this  SBIR  concentrated  on  the  development  of  the  automated  relationship 
discovery  and  relationship  aggregation  techniques.  The  objectives  for  Phase  I  arc  listed  below: 

•  Refine  and  cxpand  the  analysis  techniques  used  for  the  determination  of  the  existence  of  a 
'  relationship  between  specific  variables. 

t  ^  .  Automate  the  relationship  determination  techniques. 

•  the  “strength”  of  relationships  and  develop  a  mechanism  to  ignore  relationships 
.that  f&Ve  less  than  an  automatically  determined  cutoff  strength. 

•  Develop  an  automated  process  to  aggregate  the  relationships  and  build  relationship 

v  ^Develop  jnrof-ofconcept  code  that  implements  the  above  techniques. 

^ ’  ^  '  feform  ieSts  6?  tHS  code  on  simulated  and  real  data. 

Appendix  '  . 1-ron.i  T 

App'.'ndi--.  \  -  i  .  'i 
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All  of  the  Phase  I  technical  objectives  were  met  and  alternate  approaches  to  the  relationship 
discovery  process  were  developed.  Furthermore,  a  significant  amount  of  refinement  on  ... 
relationship  discovery  tool  was  performed.  This  resulted  in  a  fully  functional  capability  tha.  is 
currently  in  use  at  HNC.  The  Relationship  Discovery  software  tool  along  with  an  executable  copy 
of  the  HNC  Data  Base  Mining  software  mil  be  delivered  to  the  Army  (AIRMICS)  along  with  this 
Final  Report 

As  a  result  of  Phase  I,  the  relationship  discovery  capability  has  been  enhanced  to  automatically 
determine  the  existence  of  a  relationship  in  each  sub-space  and  to  determine  the  strength  of  the 
relationship.  This  capability  is  then  used  to  produce  a  comprehensive  listing  of  variables  that  are 
related  to  other  variables  and  the  strength  of  the  relationship.  Furthermore,  the  user  may  now  easily 
see  the  most  important  relationships  without  having  to  exhaustively  examine  all  relationships. 

The  relationship  discovery  tool  can  also  produce  an  aggregated  listing  of  minimum  paths  between 
selected  database  variables.  The  path  length  can  be  thought  of  as  being  the  inverse  of  relationship 
strength.  If  a  pair  of  variables  have  a  strong  relationship  strength,  then  they  have  a  “close” 
distance.  Weakly  related  variables  have  a  longer  relationship  distance.  The  Relationship  Discovery 
tool  can  compute  the  path  of  minimum  length  between  two  variables.  This  information  can  be  used 
to  assist  in  the  development  of  succinct  data  models  as  well  as  gaining  additional  insight  into  the 
variables  and  the  data. 

The  power  of  the  Relationship  Discovery  approach  is  that  it  makes  no  assumptions  about  the 
statistics  of  the  data  contained  within  the  database.  As  such,  this  approach  will  perform  equally 
well  on  data  with  gaussian  or  non-gaussian  statistics.  Additionally,  this  approach  makes  no 
assumptions  about  how  many  relationship  regions  exist  in  each  sub-space.  These  powerful 
characteristics  allow  the  relationship  discovery  concept  to  be  applied  to  a  wide  range  of  problems. 
The  output  of  Relationship  Discovery  is  a  three-dimensional  plot  of  the  PDF  for  each  of  the  two- 
dimensional  sub-space  projections  for  visual  analysis  and  a  list  of  the  variables  that  are 
“important". 


Necessary  Enhancements  Identified,  in  Phase.  I 

During  the  attainment  of  these  results  and  as  a  result  of  the  testing  of  these  concepts  on  data  sets, 
several  necessary  enhancements  to  the  overall  Data  Base  Mining  system  concept  were  identified. 
These  are. 

•  Missing  Value  Prediction:  This  capability  will  provide  a  way  of  systematically  replacing 
missing  values  in  the  data  set  with  maximum  likelihood  estimates  of  their  true  values. 

•  Bad  Data  Detection:  This  capability  will  help  provide  identification  of  potentially  “bad” 
records  in  the  data  set 

•  Data  Redundancy  Removal:  This  capability  is  necessary  to  optimize  the  model  explanation 
capability  and  eliminate  ambiguous  results. 

The  first  two  of  these,  Missing  Value  Prediction  and  Bad  Data  Detection,  together  comprise 
Automatic  Data  Geaning. 

These  enhancements  identified  during  the  Phase  I  effort  are  being  proposed  for  implementation 
during  Phase  n.  The  technical  approaches  for  these  enhancements  as  well  as  their  implications  are 
detailed  in  the  “Extensions”  section  of  this  report. 
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In  summary,  the  following  results  were  achieved  during  Phase  I  of  this  SBIR  project: 

•  Implementation  and  demonstration  of  an  automatic  approach  to  relationship  discovery  via 
2-D  subspace  projections. 

•  Development  of  a  Chi-squared  test  to  find  relationships. 

•  Implementation  of  the  capability  for  a  scatter-plot  display  for  quick  visualization  of 
relationships. 

•  Development  of  the  capability  to  perform  efficient  Parzen  windowing  on  the  approximated 
PDF  to  generate  3-D  surface  plots  of  the  PDF  projections. 

•  Development  of  the  capability  to  display  relationships  in  rank  space. 

•  Development  of  an  approach  to  the  model  reduction  problem  via  the  shortest  path  method 
using  the  Chi-squared  test  and  Cramer’s  coefficient  to  generate  the  Shortest  Path  Tree. 

•  Implementation  of  these  capabilities  in  software  and  verification  of  software  performance 
on  real  and  artificial  data. 

•  Enhancement  of  the  general  software  capabilities  of  the  Relationship  Discovery  tool: 

-  Increase  in  utilization  of  neurocomputer  processing  capabilities  to  reduce  processing 
time. 

-  Enhancement  of  usability  through  implementation  of  a  simple  graphical  user  interface. 

-  Integration  of  the  software  into  a  single  coalesced  environment. 

•  Conceptualization  of  approaches  to  three  necessary  enhancements  to  the  Data  Base  Mining 
Concept. 
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Description  of  End-Use  Problem 


Analysis  of  the  data  contained  in  large  databases  car  he  a  complex  and  time  consuming  process. 
HNC  has  developed  KENN  (Knowledge  Extraction  u  jig  Neural  Networks),  also  known  as  Data 
Base  Mining,  as  a  system  level  approach  to  this  analysis  problem.  This  approach  is  a  new  neural 
netwotk-based  concept  for  extraction,  characterization,  and  exploitation  of  knowledge  contained 
within  large  databases.  This  approach  utilizes  HNC  proprietary  techniques,  including  several 
patented  concepts. 

KENN  technology  is  in  existence  today  and  has  been  successfully  employed  on  a  prototype  basis 
for  the  analysis  of  several  diverse  types  of  data.  KENN  is  a  powerful  tool  for  the  analysis  of  data, 
particularly  when  relationships  between  variables  are  non-linear.  The  neural  network  approach 
employed  by  KENN  is  closely  related  to  advanced  statistical  techniques.  These  techniques  are 
especially  useful  when  the  parametric  form  of  the  relationships  in  the  data  are  unknown.  It  has 
been  shown  [1,2]  that  these  network  techniques  provide  a  rich  set  of  basis  functions  for  the 
solution  of  a  wide  class  of  problems.  It  has  further  been  proven  that  multi-layer  feed-forward 
networks  are  universal  approximators  for  arbitrary  functions  [3,4].  This  background  work  has 
demonstrated  that  neural  networks  are  a  valid  and  promising  vehicle  for  the  modeling  and  analysis 
of  database  relationships. 

In  addition  to  the  provable  robustness  of  the  neural  network  approach,  there  are  additional  benefits: 

•  The  level  of  expertise  required  to  use  this  approach  is,  in  general,  significantly  lower  than 
that  required  by  conventional  statistical  approaches. 

•  The  KENN  approach  will,  at  a  minimum,  produce  results  that  are  equivalent  to 
conventional  approaches. 

•  The  KENN  approach  is  capable  of  solving  non-linear  problems  due  to  its  use  of  neural 
network  techniques. 

•  The  cost  of  the  analysis  is  lower  than  a  conventional  approach  because  KENN  uses 
adaptive  techniques  which  rapidly  model  the  underlying  relationships  without  specification 
of  the  parametric  model  or  hypothesis  testing.  Neural  networks  “learn  by  example”,  and 
as  such,  no  programming  is  required. 

HNC  has  used  the  KENN  approach  for  analysis  of  real  data  sets  with  outstanding  resu'ts.  This 
development  effort  has  resulted  in  a  set  of  capabilities  that,  with  the  proposed  enhancements,  are 
directly  applicable  to  the  EIS/ESS  problem.  A  summary  of  the  characteristics  and  attributes  of 
HNC's  approach  are  as  follows: 

•  Provides  a  new  way  of  analyzing,  interpreting  and  understanding  the  contents  of  databases. 

•  Can  be  combined  with  conventional  techniques  to  provide  significant  new  capabilities. 

•  Finds  and  determines  the  structure  of  relationships.  As  such,  KENN  can  discover 
unknown  relationships. 

•  Characterizes  relationships  in  both  a  qualitative  and  quantitative  fashion,  and  can 
characterize  large  databases  in  a  concise  format  with  controllable  precision. 
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•  Allows  visualization  of  relationships  and  provides  outputs  in  both  graphical  and  tabular 
formats. 

•  Explains  relationships  by  identifying  discriminators  and  provides  a  ranking  of  their 
importance. 

•  No  prior  knowledge  of  relationships  within  the  data  is  required  and  domain  expertise  is  not 
required  for  successful  analysis. 

•  Requires  no  programming. 

•  HNC  currently  provides  this  capability  in  a  PC  and  Sun  environment.  Other  workstation 
platforms  can  also  be  utilized.  System  performance  is  enhanced  by  utilizing  a  neural 
network  co-procescor  such  as  HNC’s  commercial  products,  the  ANZA  Plus  and  Balboa. 


f  igure  1.  KENN  Functional  Data  Flow 

The  KENN  concept  recognizes  that  the  analysis  is  best  accomplished  in  an  incremental  fashion. 
As  such,  KENN  consists  of  five  key  analysis  components.  These  components,  shown  in  Figure 
1,  are: 

•  Preprocessing 

•  Relationship  Discovery 

•  Model  Building 

•  Query  Analy  sis  and  Explanation 

•  Rule  Extraction 

The  ability  to  automatically  discover  relationships  contained  within  data,  quantify  their  strength, 
and  present  them  graphically  to  the  user  for  visualization  is  defined  as  “Relationship  Discovery” 
This  capability  was  the  major  research  effort  during  Phase  I  of  this  SBIR  Project. 
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This  Phase  I SBIR  project  covered  the  enhancement  and  automation  of  the  Relationship  Discovery 
component  As  seen  in  Figure  1,  Relationship  Discovery  provides: 

•  Qualitative  information  on  the  existence  of  relationships  between  variables  in  the  input  data 

•  Input  to  the  modeling  component  containing  the  information  necessary  to  build  smaller, 
more  efficient  models  using  subsets  of  the  input  variables. 

Relationship  Discovery  in  effect  determines  “what  is  important”  in  the  data  and  estimates  the 
strength  of  the  relationships  between  the  variables.  This  detection  of  relationships  is  the  necessary 
precursor  to  the  modeling  step  where  the  detected  relationships  are  modeled. 

The  approach  uses  a  self  organizing  neural  network  technique  to  approximate  the  probability 
density  function  (PDF)  of  the  data  set.  This  approximation  of  the  PDF  is  necessary  on  large  data 
sets  to  reduce  the  size  of  the  problem  and  make  it  computationally  tractable.  The  two-dimensional 
projections  of  the  PDF  are  then  automatically  examined  for  the  existence  of  relationships  and  a 
relationship  strength  value  is  assigned  to  the  projection.  The  relationships  with  the  highest  strength 
are  presented  either  rapidly  as  a  two-dimensional  scatter  plot  or  as  a  three-dimensional  graph  of  the 
PDF,  with  the  Z  dimension  being  the  amplitude  of  the  PDF  at  each  point 

This  approach  is  similar  to  correlation  analysis,  but  provides  more  accurate  and  useful  results  since 
it  can  accommodate  nonlinear  as  well  as  linear  relationships.  Additionally,  this  approach  can  handle 
relationships  that  are  discontinuous  or  exist  only  over  a  limited  range  of  values.  Unlike 
conventional  statistical  approaches  based  on  correlation  analysis,  the  proposed  Relationship 
Discovery  technique  makes  no  assumptions  about  the  nature  of  any  functional  or  nonfunctional 
relationships  contained  within  the  data.  Instead,  the  capability  is  driven  by  the  underlying 
probability  density  of  the  data,  thereby  giving  a  true  representation  of  any  relationships. 

The  Phase  I  effort  for  this  SBIR  concentrated  on  the  development  of  the  automated  relationship 
discovery  and  relationship  aggregation  techniques.  The  objectives  that  drove  this  Phase  I  SBIR 
project  effort  are  listed  below: 

•  Refine  and  expand  the  analysis  techniques  used  for  the  determination  of  the  existence  of  a 
relationship  between  specific  variables. 

•  Automate  the  relationship  determination  techniques. 

•  Quantify  the  “strength”  of  relationships  and  develop  a  mechanism  to  ignore  relationships 
that  have  less  than  an  automatically  determined  cutoff  strength. 

•  Develop  an  automated  process  to  aggregate  the  relationships  and  build  relationship 
hierarchies. 

•  Develop  proof-of-concept  code  that  implements  the  above  techniques. 

•  Perform  tests  of  the  code  on  simulated  and  real  data. 
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Introduction  to  Relationship  Discovery 


The  problem  of  Relationship  Discovery  can  be  phrased  as  follows:  “Given  a  database,  find 
subsets  of  the  fields  of  the  database  such  that  knowing  some  of  the  fields  in  the  subset  provides 
information  about  other  fields  in  the  subset.” 

The  assumption  is  that  if  A  and  B  are  related,  then  knowing  A  should  provide  some  information 
about  B.  This  relationship  between  A  and  B  might  then  be  exploited  in  some  way.  In  particular,  if 
one  was  building  a  model  using  A  and  B  in  the  input  set,  one  might  be  able  to  drop  one  of  A  and  B 
from  the  input  set  because  of  the  known  relationship  between  the  two. 

The  first  observation  to  make  about  the  problem  stated  above  is  that  given  a  set  of  K  variables, 
there  are  2^  (two  to  the  power  of  K)  subsets  of  the  K  variables.  Thus,  it  would  be 
computationally  infeasible  to  attempt  to  examine  each  of  the  subsets  and  determine  a  measure  of  the 
information  content  of  the  variables  in  the  subset  Even  if  it  were  possible  to  do  this,  it  would  be 
unwieldy,  to  say  the  least  to  attempt  to  use  this  huge  mass  of  information. 

Thus,  we  rephrase  the  problem  in  a  slightly  different  way:  “Given  a  database,  determine  a  measure 
of  the  strength  of  the  relationship  between  pairs  of  its  fields.” 

A  solution  to  this  problem  would  provide  a  solution  to  any  instance  of  the  first  problem  for  a 
specific  subset  of  the  variables.  Given  the  subset  one  was  concerned  about,  one  would  examine  all 
pairs  of  its  variables.  If  a  pair  of  variables  were  found  to  be  strongly  related,  it  would  be  known 
that  each  one  carries  information  about  the  other. 

If  we  could  solve  this  problem  of  quantifying  the  strength  of  relationships  between  the  variables, 
we  could  then  specify  a  cutoff  strength  below  which  relationships  would  be  deemed  insignificant. 
This  would  solve  the  problem  of  finding  all  relationships  among  the  variables  of  the  database. 

What  remains  is  to  provide  the  modeling  tool  a  way  of  aggregating  the  information  contained  in  the 
relationship  strengths  to  determine  a  reduced  model  to  predict  any  given  variable.  In  this  context 
we  note  that  related  variables  exhibit  a  certain  land  of  transitivity.  If  A  is  related  to  B  and  B  is 
related  to  C,  then  A  is  related  to  C,  but  more  weakly.  This  observation,  phrased  in  terms  of  a 
distance  measure  that  is  inversely  related  to  Relationship  Strength,  forms  the  basis  of  the 
Relationship  Aggregation  Component.  Using  this  technique  we  build  a  reduced  model  by 
excluding  variables  that  are  linked  indirectly  to  the  variable  we  wish  the  model  to  predict. 

In  the  following  pages,  we  will  describe  approaches  to  reducing  the  problem  size  down  to  more 
manageable  levels.  We  will  then  describe  two  approaches  to  the  Relationship  Discovery  problem 
which  quantify  the  strength  of  the  relationships  between  variables  in  the  input  data.  Finally,  we 
Will  discuss  the  Relationship  Aggregation  component 

Throughout  this  report  we  will  use  the  terms  “field”  (as  in  a  database)  and  “variable”  (as  in  a  data 
set)  interchangeably.  We  will  also  use  the  terms  “record”  (as  in  a  database),  “observation”  (as  in  a 
data  setV.  ahn'xtita  point”  (as  in  an  n-dimensional  space)  interchangeably. 

;  ■  t.v  ,  ' 


i  N  i.  t  .  . . 
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Decreasing  the  Size  of  the  Problem 


The  time  taken  by  the  Relationship  Discovery  algorithms  that  will  be  described  in  the  following 
sections  goes  up  rapidly  as  the  number  of  data  points  goes  up.  Often  one  will  encounter  problems 
in  which  there  are  more  data  points  than  one  can  process  in  a  reasonable  amount  of  time.  In  these 
cases,  it  is  important  to  decrease  die  number  of  data  points  that  will  actually  be  used  by  the 
Relationship  Discovery  algorithms.  There  are  several  approaches  to  this. 

The  simplest  approach  is  simply  to  sample  the  data.  One  can  easily  select  a  random  subset  of  the 
observations  that  is  as  small  as  necessary.  However,  as  the  sample  size  grows  smaller,  the 
accuracy  of  the  results  gets  worse.  For  our  purposes,  the  accuracy  would  be  unacceptably  low. 

One  can  do  better  if  one  chooses  the  sample  with  a  little  more  effort  The  k-means  algorithm  is  an 
algorithm  which  generates  representative  data  points  that  are  cluster  centers  in  the  input  data.  This 
is  accurately  positions  the  points  in  the  K-dimensional  space,  but  does  not  accurately  capture  the 
relative  probability  density  at  different  regions  in  the  space.  This  is  because  clusters  of  many 
different  sizes  are  all  represented  by  a  single  cluster  center  each. 

The  Kohonen  algorithm  [5]  with  HNC’s  proprietary  conscience  mechanism  provides  a  way  of 
obtaining  cluster  centers  that  correctly  represent  the  probability  density  function  in  K-space.  This 
algorithm  generates  a  set  of  points  in  the  K-dimensional  space  spanned  by  the  input  such  that  each 
data  point  represents  a  specified  fraction  of  the  input  data  and  is  located  close  to  the  centroid  of  the 
data  points  it  represents.  This  performs  die  kind  of  data  compression  needed  by  the  Relationship 
Discovery  algorithms. 

Figures  2  and  3  show  an  example  of  the  data  compression  provided  the  Kohonen  algorithm. 
Figure  2  shows  a  set  of  input  data  points  in  a  two  dimensional  space.  Each  input  data  point  is 
represented  by  the  symbol  “x”.  X  and  Y  are  the  variables  spanning  this  space.  In  Figure  3,  the 
same  input  data  points  are  shown  along  with  the  cluster  centers  computed  by  the  Kohonen 
algorithm  with  the  conscience  mechanism.  Each  cluster  center  is  represented  by  a  large  dot 
connected  by  straight  lines  to  the  input  data  points  it  represents.  It  can  be  seen  that  the  cluster 
centers  are  at  the  centroids  of  the  clusters,  and  that  each  cluster  center  represents  the  same  fraction 
(20%)  of  the  input  data  points. 

An  interesting  approach  to  decreasing  the  size  of  the  Relationship  Discovery  problem  involves  the 
use  of  a  hierarchical  cluster  tree.  This  is  a  tree  of  clusters  in  the  input  data  having  the  following 
structure.  At  the  top  is  the  root,  a  single  cluster  containing  all  the  input  data  points.  The  next  level 
of  the  tree  (children  of  the  root)  consists  of  a  small  number  of  super-clusters  such  that  each  data 
point  belongs  to  exactly  one  super-cluster.  At  the  next  level  (grandchildren  of  die  root)  there  are  a 
small  number  of  clusters  parented  by  each  super-cluster.  This  hierarchical  structure  continues  until 
at  the  bottom  level,  each  data  point  is  its  own  sub-sub-...-sub-cluster.  This  is  illustrated  in  Figure 
4. 
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Figure  3:  Data  Compression  using  Kohonen  Network 


This  land  of  cluster  tree  could  be  built  using  the  Kohonen  algorithm  with  a  form  of  node  growth. 
Once  it  is  built,  the  idea  is  to  run  Relationship  Discovery  on  the  super-clusters,  clusters,  sub- 
clusters,  and  so  on  until  the  relationships  don’t  change  much  upon  moving  to  the  next  level.  This 
*cduld  potentially  increase  the  efficiency  of  Relationship  Discovery  significantly. 

Thus  if  very  large  problems  are  encountered,  it  is  possible  to  decrease  their  size  to  more  tractable 
levels.  The  Kohonen  algorithm  has  been  implemented  and  tested.  The  implementation  of  this 
hierarchical  cluster  tree  concept  was  not  within  the  scope  of  Phase  I.  It  was  also  not  proposed  for 
implementation  for  Phase  0  as  it  is  is  believed  that  the  Kohonen  approach  to  data  compression 
would  suffice  for  most  real  problems. 
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_ INPUT  DATA  POINTS 

Figure  4:  A  Hierarchical  Ouster  Tree 
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Approaches  to  Relationship  Discovery 


Relationship  Discovery  Via  the  Chi-squared  Test 

Relationship  Discovery  using  the  Chi-squared  test  is  performed  as  follows:  Compress  the  data  and 
form  the  projection  of  the  compressed  data  set  onto  a  pair  of  variables,  say  A  and  B.  This  gives  us 
N  ordered  pairs  of  data  points.  Map  these  points  into  rank  space.  Bin  the  N  ordered  pairs  into  a 
small  constant  number  of  bins  (a  5  by  5  grid  of  25  bins  is  usually  good  enough).  Compute  the 
expected  number  of  points  in  each  bin,  and  the  actual  number  of  observations  that  fell  into  the  bin. 
The  Chi-squared  statistic  is  a  function  of  the  number  of  data  points  that  fell  into  each  bin  and  the 
number  that  were  expected  to  fall  into  each  bin  assuming  that  A  and  B  were  statistically 
independent.  If  the  Chi-squared  statistic  is  above  a  cutoff  threshold  for  a  given  confidence  level, 
there  is  a  relationship.  Do  all  this  for  all  0{K*K)  non-redundant  pairs  of  variables  to  determine  all 
bivariate  relationships  among  the  variables. 

Rank  space  for  a  particular  variable  is  a  space  in  which  each  value  of  the  variable  has  been 
replaced  by  its  rank  in  the  set  of  all  values  of  that  variable.  Thus,  in  rank  space,  the  variable  is 
uniformly  (i.e.,  evenly)  distributed. 

The  above  approach  to  Relationship  Discovery  is  based  on  the  observation  that  if  two  variables 
have  been  transformed  to  rank  space,  they  will  be  evenly  distributed  when  considered  individually, 
but  they  may  or  may  not  be  evenly  distributed  when  taken  together.  If  they  are  evenly  distributed 
when  taken  together  (for  example,  when  seen  in  a  scatter  plot),  they  are  statistically  independent 
If  they  arc  not  evenly  distributed  when  taken  together,  we  will  say  they  are  related. 

Figure  5  shows  a  pair  of  variables,  X  and  Y,  in  the  original  space  and  after  being  transformed  to 
rank  space.  Note  that  once  they  are  transformed  to  rank  space,  they  are  evenly  distributed 
individually,  but  not  when  taken  together.  Consequently,  X  and  Y  are  related. 


IN  ORIGINAL  SPACE  IN  RANK  SPACE 


.  Figure  5:  Transformation  to  Rank  Space 

)’:v.  .  P'r  K 
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An  illustration  of  the  Chi-squared  test  is  provided  in  Figure  6.  Here  the  variables  have  been 
transformed  to  rank  space  and  segmented  into  bins.  All  the  bins  do  not  have  around  the  same 
number  of  data  points  in  them,  since  there  are  areas  that  are  too  dense  and  too  sparse.  Because  of 
the  extent  of  the  variation  in  the  density  of  data  points,  the  Chi-squared  test  deems  there  to  be  a 
relationship. 

This  approach  is  based  on  a  strong  theoretical  foundation,  since  the  Chi-squared  test  is  a  well- 
studied  statistical  test.  It  finds  nonlinear  as  well  as  linear  relationships.  To  find  all  bivariate 
relationships,  this  algorithm  takes  0(K^*N)  time.  This  time  complexity  makes  the  data 
compression  operation  crucial.  Reducing  N  via  data  compression  makes  this  algorithm  feasible  to 
execute. 

This  approach  could  be  extended  to  find  all  relationships  of  third  or  higher  order.  It  would  be 
necessary  to  examine  all  distinct  three-element  subsets  of  the  set  of  variables,  except  those 
containing  two-elements  subsets  for  which  a  relationship  was  already  found.  In  the  worst  case, 

the  complexity  would  be  0(K^*N).  To  find  all  m-variate  relationships,  the  algorithm  would  take 
0(Km*N)  time. 

It  is  to  be  noted  that  the  highest  meaningful  value  of  m  is  OOog  N).  For  m  greater  than  0(k>g  N) 
there  are  just  not  enough  data  points  to  justify  a  claim  that  there  is  a  m-variate  relationship  in  the 
data. 

This  approach  can  be  used  for  finding  all  low-order  relationships  in  the  data.  This  is  reasonable 
since  the  relationships  that  are  of  greatest  interest  are  of  low  order.  Furthermore,  low  order 
relationships  have  more  impact  on  the  user  since  they  can  be  displayed  graphically  for  visualization 
by  the  user.  Finally,  almost  all  higher  order  relationships  among  finite  statistical  distributions  have 
detectable  lower  order  projections.  Thus,  if  one  detected  all  lower  order  relationships,  one  could 
infer  the  existence  of  almost  all  higher  order  relationships  that  would  occur  in  practice  with  a 
Relationship  Aggregation  algorithm. 


Figure  6:  Relationship  Discovery  using  the  Chi-squared  Test 
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Relationship  Discovery  Via  Probing  of  K-space 

The  Probing  of  K-space  approach  to  Relationship  Discovery  is  performed  as  follows: 

Transform  the  entire  compressed  data  set  (all  variables)  to  a  K-dimensional  rank  space.  Thus  each 
variable  taken  by  itself  has  a  uniform  distribution.  Take  one  of  the  input  vectors,  X,  and  perturb 
one  of  its  components  (say  the  i'th  component)  to  give  X’.  This  is  clearly  illustrated  in  Figure  7. 
Find  the  nearest  input  vector  to  X'  in  a  Euclidean  K-space  that  has  been  stretched  in  the  direction  of 
the  i'th  component  Let  this  nearest  vector  be  Y.  Call  (Y-X')  the  vector  D. 

The  larger  D  is,  the  more  sensitive  the  distribution  is  to  the  i'th  component  (the  component  we 
initially  perturbed).  Most  importantly,  the  components  of  D  represent  the  sensitivity  of  all  other 
dimensions  to  changes  in  the  i'th  dimension. 


Figure  7:  Probing  of  K-space 


We  are  most  interested  in  components  of  D  that  are  consistently  large  in  magnitude.  So  we 
compute  this  D  vector  using  many  different  input  vectors  in  turn  as  starting  points.  We  end  up 
with  a  distribution  of  the  components  of  D,  upon  which  we  use  the  sum-of-squares  statistic  to 
measure  whether  a  particular  component  is  consistently  large.  A  strong  relationship  is  indicated  by 
a  high  value  of  the  sum-of-squares  statistic. 

■  We  accumulate  the  sum-of-squares  statistic  for  perturbations  of  each  dimension,  giving  a  square 
matrix  that  looks  like  a  correlation  matrix,  but  is  different  because  it  carries  information  about 
nonlinear  relationships  as  well  as  linear  ones. 

This  approach  generates  a  square  matrix  of  pairwise  relationship  strengths  that  contain  information 
about  both  bivariate  and  higher  order  relationships.  As  was  previously  noted,  the  highest  order 
relationship  it  can  detect  is  0(log  N),  since  there  are  not  enough  data  points  to  justify  a  claim  for 
the  existence  of  relationships  of  higher  order  than  0(log  N). 
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This  approach  is  a  heuristic,  and  is  similar  in  flavor  to  Sensitivity  Analysis.  Parameters  that  must 
be  chosen  well  for  this  approach  to  work  well  are  the  stretch  factor  and  the  number  of  input  data 
points  to  be  perturbed  for  each  dimension.  Currently,  we  are  obtaining  good  results  with  both  the 
stretch  factor  and  the  number  of  input  data  points  perturbed  set  to  the  square  root  of  N.  There  is, 
however,  no  theoretical  justification  for  this  choice. 
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Relationship  Aggregation 


Both  algorithms  for  Relationship  Discovery  described  earlier  provide  quantitative  measures  of  the 
strength  of  the  relationships  between  pairs  of  the  input  variables.  These  strengths  can  be 
considered  on  an  abstract  level  to  be  the  weights  of  a  weighted,  undirected  graph.  Each  variable  is 
represented  by  a  vertex  and  each  relationship  by  an  edge  of  this  graph.  Strongly  related  variables 
are  linked  by  short  edges  and  weakly  related  variables  are  linked  by  long  edges.  Totally  unrelated 
variables  are  linked  by  infinitely  long  edges  (or  equivalently,  nonexistent  edges). 

In  this  Relationship  Graph,  it  is  of  interest  to  know  the  shortest  path  from  one  variable  to  another. 
That  is,  given  two  variables,  through  which  other  variables  are  they  most  strongly  related?  This 
kind  of  query  will  be  addressed  by  die  Relationship  Aggregation  Component 


XI 


A  good  set  of  variables  to  predict  X 1 


Figures  8b:  The  Shortest  Path  Tree  Shown  from  Figure  8a  Separately 
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For  example,  one  might  have  a  Relationship  Graph  similar  to  the  one  shown  in  Figure  8a.  In  this 
figure,  the  nodes  represent  the  variables  in  the  data  set,  XI  through  XS.  The  lines  (both  solid  and 
broken)  joining  the  nodes  represent  relationships  among  the  variables.  If  two  nodes  are  not 
connected  then  they  are  not  related.  The  solid  lines  represent  the  relationships  that  are  on  the 
shortest  path  tree  based  on  the  variable  X 1 . 

In  Figure  8b,  the  same  shortest  tree  is  shown  in  the  format  of  a  tree.  The  node  labeled  XI  is  at  the 
root  of  tree,  and  represents  the  variable  XI. 

The  information  contained  in  this  tree  representation  is  useful  because  it  provides  to  the  user 
structural  information  about  the  variables  in  the  model. 

Suppose  the  user  was  interested  in  indirect  relationships  between  the  variables  XI  and  X3.  Upon 
querying  the  Relationship  Aggregation  system,  the  result  would  be:  “The  strongest  relationship 
path  from  XI  to  X3  is  (X1,X2,X3)." 

There  are  other  methods  to  aggregate  the  information  contained  in  the  output  of  Relationship 
Discovery  and  condense  it  to  present  to  the  user.  This  approach  has  the  advantage  of  having  a  clear 
visual  interpretation  as  the  variables  are  placed  in  a  hierarchical  tree  structure.  This  allows  the  user 
to  observe  the  strong  linkages  among  the  variables  of  the  data  set,  organize  them,  and  understand 
why  variables  are  needed  or  not  needed  in  the  model. 
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Integrated  Relationship  Discovery  Tool 


The  Integrated  Relationship  Discovery  tool  was  built  by  merging  the  two  forms  of  Relationship 
Discovery  along  with  Relationship  Aggregation  together  into  a  single  software  tool  along  with  a 
data  compression  facility  necessary  to  nuke  the  tool  run  efficiently. 

This  software  tool  is  called  “RELDISC”  and  works  in  tandem  with  a  data  preprocessing  tool  called 
SCALER.  SCALER  reads  in  a  data  file  in  ASCII  format,  scales  the  data  to  a  range  specified  by  the 
user,  and  writes  the  data  to  a  binary  file.  RELDISC  reads  in  this  binary  Hie,  trains  a  Kohonen 
network,  runs  either  or  both  Relationship  Discovery  algorithms,  and  performs  Relationship 
Aggregation. 

RELDISC  allows  the  user  to  characterize  the  data  set  using  self  organization.  This  results  in  a  fully 
trained  Kohonen  weight  file  representing  the  probability  densities  contained  within  the  input 
database.  This  network,  then,  contains  the  “essence"  of  the  database.  Once  the  database  has  been 
characterized,  RELDISC  automatically  searches  for  relationships  and  displays  strong  relationships. 
It  then  accepts  user  input  to  save  particularly  interesting  PDF  surfaces  as  data  files  that  serve  as 
input  for  off-line  3D  plot  programs.  If  desired,  the  user  can  then  view  selected  relationships, 
perform  relationship  aggregation  and  generate  minimum  distance  graphs. 

RELDISC  starts  by  training  a  Kohonen  network  to  provide  a  compressed  representation  of  the 
Probability  Density  Function  of  the  input  data.  As  this  Kohonen  net  trains,  its  weights  converge 
on  points  in  the  multi-dimensional  input  space  that  are  the  centers  of  equiprobable  clusters.  Thus, 
at  the  end  of  training,  the  weights  of  the  Kohonen  network  provide  a  characterization  of  the  entire 
Probability  Density  Function  through  expansion  of  the  equiprobable  cluster  centers  into  Gaussians 
using  the  Parzen  windowing  process. 

RELDISC  allows  the  user  to  select  one  or  both  of  the  two  Relationship  Discovery  algorithms 
described  previously  in  this  report.  These  algorithms  have  been  implemented  in  C  to  run  as  a 
“stubnet”  on  the  ANZA-Plus  board. 

It  is  to  be  noted  that  the  Relationship  Discovery  algorithms  are  speeded  up  significantly  by  the  data 
compression  provided  by  the  Kohonen  network.  The  amount  of  processing  that  would  be  required 
if  the  data  were  not  compressed  in  this  manner  would  render  the  task  infeasible  for  real-life  sized 
data  sets. 

RELDISC  feeds  the  output  of  Relationship  Discovery  to  a  viewing  module  that  automatically 
displays  as  many  of  the  strongest  relationships  as  the  user  wishes  to  see.  This  can  also  be  run  in 
manual  mode  where  the  user  requests  to  see  a  particular  relationship. 

RELDISC  takes  the  results  of  Relationship  Discovery  and  runs  the  Relationship  Aggregation 
;  algorithm  cm  them.  This  generates  a  “Shortest  Path  Tree”  rooted  at  a  user-specified  variable.  The 
'  user  typically  chooses  to  build  the  Shortest  Path  Tree  from  the  variable  which  is  to  be  the  output  of 
'  a  backpropagation  model.  The  tree  contains  information  about  the  distances  of  other  variables 
from  the  chosen  variable,  as  well  as  information  about  variables  that  contain  redundant 
information. 

Once  Relationship  Aggregation  is  run,  RELDISC  generates  a  file  containing  linkage  paths  from 
user-chosen  variables  to  the  output  variable.  These  paths  are  completely  contained  within  the 
Shortest  Path  Tree,  and  consequently  are  Shortest  Paths  themselves. 
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Testing 


Tests  were  performed  on  both  approaches  to  Relationship  Discovery.  The  test  data  consisted  of 
artificially  generated  data  from  a  linear  congruential  pseudorandom  number  generator.  The  Hara 
contained  related  and  unrelated  variables.  Varying  amounts  of  noise  were  added  to  the  data,  and 
the  performance  of  Relationship  Discovery  was  observed  as  a  function  of  signal-to-noise  ratio. 

The  following  list  contains  the  generating  equations  for  a  sample  data  set  similar  to  those  used  for 
testing. 


A  =  noise 
B  =  A  +  noise 
C  =  A*A  +  noise 
D  =  cos(A)  +  noise 
E  =  exp(-A*A)  +  noise 
F  =  noise 


The  noise  can  be  assumed  to  have  a  Gaussian  distribution  with  mean  0  and  variance  1 . 

The  test  results  were  as  expected.  As  the  amount  of  noise  increased,  the  strength  of  the 
relationships  detected  (which  represent  the  algorithm's  confidence  in  its  results)  deteriorated. 
However,  it  is  to  be  noted  that  the  algorithms  were  able  to  take  rather  large  amounts  of  noise  before 
they  erroneously  ranked  a  non -relationship  above  a  relationship. 

Specifically,  the  PDF-above -Threshold  algorithm  performed  correctly  with  a  signal-to-noise  ratio 
as  low  as  1.5:1,  while  the  Probing-of-N-Space  algorithm  performed  correctly  with  a  signal-to- 
noise  ratio  as  low  as  2:1.  (The  performance  was  defined  as  correct  if  all  the  relationships  had 
strengths  that  were  greater  than  any  of  the  non-relationships.  In  the  above  example,  that  would 
mean  that  all  the  relationships  involving  variable  F  would  have  lower  strength  than  any  of  the 
others.) 


Tests  on  Chi-squared -Relationship  Discovery 

Test  data  set  1  was  created  using  the  equations  shown  in  Table  1.  The  constant  SN  can  be 
interpreted  (in  a  loose  way)  as  a  signal-to-noise  ratio.  As  SN  becomes  smaller,  the  amount  of 
noise  being  added  to  the  true  functional  relationship  increases,  as  is  evident  in  the  equations.  Test 
data  set  2  is  described  by  the  equations  in  Table  2.  The  constant  SN  has  the  same  interpretation. 
In  both  series  of  tests,  the  number  of  data  points  used  was  1000. 


X  =  Gaussian  random 
Linear  =  2*X+noise/SN 
Quadratic  =  X*X  +  noise/SN 
Cosine  =  cos(X)  +  noise/SN 
Random  =  Gaussian  random 


Table  1:  Equations  used  to  generate  test  data  set  1 
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XI  =  Gaussian  random 
X2  =  Gaussian  random 
Linear  =  X1+X2  +  noise/SN 
Quadratic  =  X1*X1  +  X2  +  noise/SN 
Random  =  Gaussian  random 

Table  2:  Equations  used  to  generate  test  data  set  2 


The  results  of  Chi-squared  Relationship  Discovery  on  Test  Data  Set  1  are  shown  in  Appendix  1. 
The  results  on  Test  Data  Set  2  are  shown  in  Appendix  2. 

From  examining  the  output  of  the  tests,  it  can  be  seen  that  for  low  amounts  of  noise,  the 
performance  was  100%  successful.  However,  as  the  amount  of  noise  added  to  the  functions 
increased,  the  performance  of  the  algorithm  deteriorated.  It  can  be  seen  that  a  large  amount  of 
noise  was  added  before  the  performance  deteriorated  significantly. 

An  important  point  to  note  is  that  in  Test  Data  Set  2,  the  relationships  were  multivariate  (higher 
order)  and  not  just  bivariate.  However,  the  algorithm  was  able  to  detect  the  relationships  between 
pairs  of  variables  related  through  a  higher  order  relationship.  For  example,  the  variables  “XI”  and 
“Linear”  were  found  to  be  strongly  related,  as  were  “X2”  and  “Linear”.  It  can  be  seen  that  this 
higher  order  relationship  was  detectable  because  of  its  projections.  Putting  the  two  relationships 
together,  a  Relationship  Aggregation  algorithm  would  be  able  to  conclude  that  XI  and  X2  were 
related,  though  less  strongly.  Thus  the  higher  order  relationship  would  be  detected  along  with  the 
bivariate  (second  order)  relationships. 


on  Profring-of-K-spaw  Relationship  Discovery 

The  results  of  Probing  of  K-space  Relationship  Discovery  on  Test  Data  Set  1  arc  shown  in 
Appendix  3.  The  results  on  Test  Data  Set  2  are  shown  in  Appendix  4. 

From  examining  the  output  of  the  tests,  it  can  be  seen  that  for  low  amounts  of  noise,  the 
performance  was  100%  successful  from  the  point  of  view  that  all  non-existent  relationships  were 
ranked  below  all  relationships  that  existed.  As  the  amount  of  noise  added  to  the  functions 
increased,  the  performance  of  the  algorithm  deteriorated.  It  can  be  seen  that  a  large  amount  of 
noise  was  added  before  the  performance  deteriorated  significandy.  However,  the  performance  was 
not  as  good  as  that  of  the  Chi-squared  test  algorithm. 


using  Civilian  Customer  Data 

The  Chi-squared  Relationship  Discovery  algorithm  was  given  further  testing  on  a  data  set  obtained 
from  a  civilian  customer  of  HNC.  This  data  set  contained  data  on  cellular  phone  usage  and  billing. 
The  list  of  the  relationships  detected  is  given  in  Appendix  5.  The  relationships  that  were  detected 
made  intuitive  sense,  and  upon  looking  at  the  data  it  could  be  seen  that  the  relationships  were 
actually  present 
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Extensions 


Necessary  Enhancements  Identified  in  Phase  I 

During  the  attainment  of  these  results  and  as  a  result  of  the  testing  of  these  concepts  on  data  sets, 
several  necessary  enhancements  to  the  overall  Data  Base  Mining  system  concept  were  identified. 
These  arc: 

•  Missing  Value  Prediction:  This  capability  will  provide  a  way  of  systematically  replacing 
missing  values  in  the  data  set  with  maximum  likelihood  estimates  of  their  true  values. 

•  Bad  Data  Detection:  This  capability  will  help  provide  identification  of  potentially  “bad” 
records  in  the  data  set 

•  Data  Redundancy  Removal:  This  capability  is  necessary  to  optimize  the  model  explanation 
capability  and  eliminate  ambiguous  results. 

The  first  two  of  these.  Missing  Value  Prediction  and  Bad  Data  Detection,  together  comprise 
Automatic  Data  Cleaning. 

These  enhancements  identified  during  the  Phase  I  effort  are  being  proposed  for  implementation 
during  Phase  n.  The  technical  approaches  for  these  enhancements  as  well  as  their  implications  are 
detailed  in  this  section. 


Missing  Value  Prediction 

Motivation  for  Missing  Value  Prediction 

The  problem  of  Missing  Value  Prediction  is  of  significant  practical  importance  in  the  analysis 
of  data  sets  obtained  from  real-life  databases.  In  typical  databases,  many  records  in  the 
database  contains  at  least  one  missing  value.  These  missing  values  can  significantly  impact  the 
perceived  statistics  of  the  database  and  adversely  effect  the  resulting  analysis.  In  building 
models,  it  is  important  to  have  good,  clean  data.  The  model  can  be  no  better  than  the  data  that  is 
used  to  build  the  model.  Missing  values  in  the  data  set  often  constitute  a  significant  problem  in 
obtaining  a  good  data  set  to  build  a  high-quality  model. 

A  formal  definition  of  the  missing  value  prediction  problem  can  be  stated  as:  “Given  a  database 
in  which  some  of  the  variables  for  each  of  the  observations  are  missing,  determine  good 
estimates  for  the  missing  values  that  are  consistent  with  the  other  information  in  the  same 
observation  and  in  the  rest  of  the  data  set” 

Concept  of  Identity  Map  Network 

To  perform  Missing  Value  Prediction,  the  Identity  Map  Network  is  used.  The  Identity  Map 
Network  is  a  Multi-layer  Back-Propagation  Network  that  is  trained  to  reconstruct  the  input  data 
at  the  output  layer.  As  seen  in  Figure  9,  the  Identity  Map  Network  has  as  many  output  nodes  as 
it  has  inputs.  However,  the  hidden  layer  has  fewer  nodes  than  either  the  input  or  the  output 
layers.  TTie  concept  of  the  identity  map  network  was  first  described  in  [9J. 
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Small  Hidden  Layer 


Figure  9:  Identity  Map  Network 


For  simplicity  of  explanation,  the  solution  of  the  problem  is  described  using  Identity  Map 
Networks  with  only  one  hidden  layer.  However,  it  should  be  noted  that  Identity  Map 
Networks  with  three  hidden  layers  are  significantly  more  powerful,  and  will  likely  be  the 
architecture  of  choice  for  real-life  problems.  The  techniques  described  here  readily  scale  up  to 
three  hidden  layer  Identity  Map  Networks. 

Considering  the  information  flow  through  the  Identity  Map  Network,  it  can  be  seen  that  the 
hidden  layer  acts  as  an  “information  bottleneck".  As  the  network  learns  to  reconstruct  the  input 
information  at  the  output  layer,  it  is  forced  to  remove  the  redundancy  from  the  original 
representation  of  the  data  and  transmit  only  the  high-information  component  of  the  original  data 
through  the  hidden  layer  information  bottleneck.  It  is  also  forced  to  learn  to  reconstruct  the 
original  data  from  the  information  transmitted  through  the  bottleneck.  Furthermore,  in 
removing  the  redundancy  in  the  data,  the  network  learns  the  relationships  between  the  fields  in 
the  data  set. 

Almost  all  data  in  real-life  databases  contains  a  significant  amount  of  redundancy.  The  Identity 
Map  Network  is  able  to  determine  that  redundancy  and  exploit  it  to  solve  the  missing  value 
prediction  problem. 

Relationships,  Redundancy,  and  Data  Manifolds 

Consider  a  data  set  with  k  variables  (equivalently,  a  database  with  k  fields).  Each  of  the  k 
variables  defines  a  dimension  (coordinate)  in  a  k-dimensional  space.  Each  observation  (or 
record)  in  this  data  set  is  a  point  in  this  k-dimensional  space.  If  any  of  the  variables  are  related 
to  one  ahother,  there  will  be  redundancy  in  the  data  set,  since  some  variables  contain 
information  about  others.  Thus  a  relationship  in  the  data  is  a  redundancy.  This  redundancy  will 
be  exploited  as  part  of  the  solution  of  die  missing  value  prediction  problem. 
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When  the  data  set  contains  relationships,  the  data  points  will  not  occupy  the  entire  k- 
dimensional  space,  but  will  lie  on  a  lower-dimensional  manifold  in  this  space.  The  term  “Data 
Manifold”  is  used  to  refer  to  the  manifold  on  which  the  data  points  lie.  As  examples,  a  curve  is 
a  one-dimensional  manifold  while  a  surface  is  a  two-dimensional  manifold.  In  general  it  takes  k 
values  to  specify  a  point  in  a  k-dimensional  space.  However,  if  it  is  known  that  the  point  lies 
on  a  lower-dimensional  manifold,  say  of  p  dimensions  where  p  <  k,  it  would  take  only  p 
values  to  specify  the  point.  Figure  10  shows  an  example  where  k  =  2  and  p  =  1:  a  one¬ 
dimensional  Data  Manifold  in  a  three  dimensional  space.  In  Figure  10,  it  can  be  seen  that  only 
one  parameter  is  necessary  to  specify  any  point  on  the  data  manifold. 

The  Identity  Map  Network  as  a  Data  Manifold  Approximater 

If  the  data  points  lie  on  a  p-dimensional  manifold  in  a  k-dimensional  space  (where  p  <  k),  then 
the  architecture  of  the  Identity  Map  Network  would  be  k  inputs  nodes,  p  hidden  nodes,  and  k 
output  nodes.  The  information  bottleneck  would  allow  only  p  values  to  pass  from  the  input 
side  of  the  net  to  the  output  side  in  order  to  reconstruct  the  values  of  the  k  variables. 

This  network,  in  being  trained  to  reconstruct  the  input  at  the  output,  is  forced  to  learn  the 
mapping  from  the  original  k-dimensional  space  to  the  p-dimensional  Data  Manifold  in  the  first 
half  of  the  network.  This  mapping  to  the  p-dimensional  manifold  is  represented  by  the 
activations  on  the  p  hidden  units.  In  the  second  half  of  the  network  (from  the  hidden  unit 
activation  to  the  output),  it  is  forced  to  learn  the  inverse  mapping  from  the  p-dimensional  Data 
Manifold  to  the  original  k-dimensional  space. 

The  activation  values  of  the  hidden  nodes  of  this  network  constitute  the  values  of  p  derived 
variables  that  parameterize  the  Data  Manifold.  Points  on  the  Data  Manifold  are  completely 
specified  given  values  for  these  p  variables.  The  second  half  of  the  Identity  Map  Network 
allows  the  translation  of  that  specific  manifold  data  point  back  into  a  point  in  the  original  k- 
dimensional  space. 


Figure  10:  A  One-dimensional  Manifold  in  a  Two-dimensional  Space. 
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Determining  the  Dimension  of  the  Data  Manifold 

In  general,  the  dimension  of  the  Data  Manifold  is  not  known  a  priori.  The  concept  of  the 
Automatic  Neural  Network  (ANN)  will  be  used  to  automatically  determine  the  optimal  net 
architecture,  and  thus  the  dimension  of  the  Data  Manifold.  A  description  of  the  operation  of  the 
ANN  for  this  problem  is  given  below. 

Assuming  a  k  variable  data  set  is  given,  the  ANN  will  start  with  a  net  architecture  that  has  k 
input  nodes,  one  node  in  the  hidden  layer,  and  k  output  nodes.  This  network  architecture  will 
be  trained  using  the  ANN  Back-Propagation  Learning  Algorithm,  which  will  effectively 
perform  gradient  descent  error  minimization  in  the  parameter  space.  In  attempting  to  transmit 
the  input  information  through  the  single-node  bottleneck  in  the  hidden  layer,  the  network  will 
learn  the  first  (most  important)  “Generalized  Principal  Component”  of  the  data.  This  will  be  the 
single  derived  variable  that  contains  the  most  information  for  the  reconstruction  of  the  input 
data.  Note  that  this  variable  is  derived  in  a  nonlinear  way  from  the  input  data.  Generalized 
Principal  Components  will  be  discussed  in  greater  detail  later. 

Once  the  network  converges  (that  is,  continued  learning  only  leads  to  small  changes  in  the 
mean  squared  error  that  are  below  some  threshold),  learning  on  the  weights  associated  with 
this  part  of  the  network  is  discontinued.  A  new  node  is  then  added  in  the  hidden  layer  along 
with  all  its  associated  weights.  These  new  weights  are  trained  while  keeping  the  weights 
associated  with  the  previous  node(s)  frozen.  This  process  of  adding  nodes  is  continued  until  an 
acceptable  reconstruction  quality  is  achieved.  An  example  of  a  typical  quality  specification 
might  be  a  mean  squared  reconstruction  error  of  less  than  0.0001  scaled  units.  The  exact  value 
of  the  error  would,  of  course,  be  data  set  specific.  When  a  reconstruction  quality  that  meets  or 
exceeds  the  requirements  has  been  achieved,  both  the  optimal  network  architecture  and  the 
dimension  of  the  Data  Manifold  will  have  been  determined.  The  dimension  of  the  Data 
Manifold  is  the  number  of  hidden  nodes  that  were  needed  far  good  reconstruction. 

Modifications  to  the  Generic  Backpropagation  Learning  Law 

The  Missing  Values  Problem  brings  with  it  two  special  requirements  that  necessitate  minor 
modifications  to  the  Back-Propagation  learning  law.  Firstly,  the  obvious  fact  that  the  network 
must  be  trained  with  missing  values  raises  a  problem,  since  the  Back-Propagation  learning  law 
makes  no  provisions  for  missing  values  in  the  data.  The  proposed  solution  to  this  problem  is  as 
follows: 

•  During  forward  propagation,  the  missing  value  is  set  to  a  random  number.  This  is  for 
the  purpose  of  biasing  the  net  towards  not  using  the  information  in  this  variable.  Each 
time  this  record  is  presented  to  the  network,  the  missing  value  is  set  to  a  different 
random  value. 

•  During  backward  propagation,  the  error  for  the  missing  value  at  the  output  node  is  set 
to  zero,  since  nothing  is  known  about  this  error. 

•  The  second  problem  is  due  to  a  subtle  point  regarding  learning  with  redundant 
information.  Given  a  redundancy  in  the  input  data,  the  network  has  a  large  amount  of 
choice,  regarding  how  it  wishes  to  compress  that  redundancy  and  exactly  what 
information  it  passes  through  the  information  bottleneck.  For  example: 

...  ..  :*l 

•  Consider  a  two-variable  set  consisting  of  X  and  Y.  Assume  X  and  Y  are  linearly 
related.  For  simplicity  consider  a  completely  linear  network.  The  network  may  choose 

..  .  to  send  the  value  of  X,  Y,  or  a  linear  combination  of  X  and  Y  through  the  bottleneck  in 
order  to  reconstruct  both  X  and  Y  on  the  other  side.  The  learning  law  will  not  bias  it 
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either  way  since  both  contain  identical  information.  Consequently  the  network  will 
make  a  decision  on  this  point  that  is  predicated  upon  the  initial  conditions  of  the 
network,  which  are  random. 

•  If  neither  value  is  missing,  it  doesn't  matter  which  choice  the  network  makes. 
However,  suppose  the  network  has  chosen  to  transmit  the  value  of  X  through  the 
bottleneck  and  ignore  the  value  of  Y.  Then  a  missing  value  of  X  would  cause  the 
network  to  be  unable  to  reconstruct  anything  on  the  other  side.  This  would  make  it 
impossible  to  predict  missing  X's  from  known  Y’s.  However  the  opposite  problem, 
that  of  predicting  missing  Y's  from  known  X's,  would  cause  the  network  no  difficulty. 

•  Thus,  when  the  network  is  given  choices  regarding  what  information  to  transmit 
through  the  bottleneck  it  would  be  preferable  to  have  it  choose  to  generate  the 
transmitted  information  by  blending  the  information  from  all  its  possible  sources.  This 
can  be  made  to  happen  by  penalizing  the  network  during  learning  for  taking  too  much 
information  from  any  one  source.  Specifically  this  means  adding  to  the  error  term  for 
the  network  a  penalty  proportional  to  the  sum  of  the  squares  of  the  weights  in  the  pre¬ 
bottleneck  layer.  The  network,  in  minimizing  the  total  error,  will  attempt  to  drive  down 
the  values  of  large  weights  if  this  can  be  done  without  causing  a  deterioration  in  the 
reconstruction  quality.  These  modifications  to  the  learning  law  will  result  in  the  desired 
network  behavior  during  training. 

Predicting  Missing  Values 

For  the  rest  of  this  discussion,  assume  a  trained  Identity  Map  Network  with  an  optimal 
architecture  is  available.  The  trained  network  will  have  discovered  all  the  relationships  and 
redundancy  in  the  data.  The  network  will  be  able  to  use  this  information  to  predict  the  missing 
values  in  the  input  data. 

Assume  there  arc  k  variables,  and  for  some  data  point,  only  k-1  values  for  these  variables  arc 
available,  that  is,  one  value  is  missing.  Suppose  variable  i  (where  i  is  between  1  and  k, 
inclusive)  is  the  one  whose  value  is  unknown. 

The  following  procedure  is  then  used: 

•  Substitute  a  random  value  for  the  unknown  variable  (call  it  yg)  and  cycle  the  data  point 
through  the  trained  network.  This  will  provide  a  prediction  of  the  unknown  variable 
which  will  be  called  y  j . 

•  Substitute  yj  as  the  i'th  variable  in  the  data  point,  and  cycle  the  data  point  again  through 
the  network. 

•  Continue  this  process,  obtaining  a  sequence  of  predictions  {yg,  yj,  y2«  •  )  until  the 
change  in  the  predicted  value  between  iterations  approaches  zero. 

Figure  1  i  shows  a  sequence  of  predictions  for  a  typical  case  of  Missing  Value  Prediction  with 
a  one  dimensional  manifold  in  a  two  dimensional  space. 
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Figure  11:  Missing  Value  Prediction 


Note  that  all  the  known  values  are  fed  into  the  network  without  change  at  every  iteration.  Only 
the  unknown  value  (which  we  are  trying  to  predict)  is  being  updated  at  every  iteration  with  the 
output  from  the  network  at  the  previous  iteration.  This  sequence  has  been  observed  in  practice 
to  always  converge  to  four  decimal  places  within  50  iterations.  Convergence  has  been  proved 
theoretically  for  the  linear  case.  If  the  network  was  trained  to  a  sufficient  degree  of  accuracy, 
the  sequence  will  converge  to  a  value  which  is  a  good  prediction  of  the  variable  on  the  basis  of 
the  values  of  the  other  variables  in  the  data  set 

If  the  variable  with  the  missing  value  turned  out  to  be  one  which  did  not  have  any  redundant 
representations  in  the  rest  of  the  data  set,  then  the  sequence  will  converge  immediately  on  the 
random  value  that  was  chosen  initially.  This  is  a  rare  case  but  is  possible.  To  deal  with  this 
case  where  the  data  set  does  not  contain  the  information  needed  to  predict  the  missing  variable, 
it  would  be  appropriate  to  begin  the  iterations  with  the  mode  (the  most  probable  value)  of  the 
missing  variable  across  the  data  set,  instead  of  a  randomly  chosen  value.  Then,  even  if  no  extra 
information  is  available,  the  predicted  value  will  be  the  maximum  likelihood  estimate  of  the 
missing  variable. 

Predicting  Multiple  Missing  Values  in  the  Same  Record 

The  same  approach  may  be  used  to  predict  multiple  missing  values  in  the  same  record 
simultaneously.  The  known  values  are  fed  into  the  net  at  every  iteration  along  with  the  latest 
prediction  of  the  unknown  values.  The  natural  question  that  arises  is  that  of  the  degradation  of 
performance  as  the  number  of  simultaneous  missing  values  increases.  That  is,  how  many 
missing  values  can  be  tolerated  simultaneously?  The  answer  to  this  question  lies  in  concepts  of 
Generalize^.  Principial  Component  Analysis,  which  will  be  discussed  in  a  later  section. 
Neverth^^s  a  brief  answer  can  be  given  as  follows. 

v'lYiO. 
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Missing  values  are  predicted  based  on  redundant  information  carried  in  other  variables.  If  all 
the  variables  that  could  predict  the  missing  variable  are  also  missing,  die  prediction  will  not  be 
accurate.  But  if  some  of  the  required  variables  are  present  to  predict  each  of  the  missing  values, 
the  prediction  will  be  accurate.  In  terms  of  Generalized  Pnncipal  Component  Analysis,  the 
prediction  of  any  particular  variable  will  depend  on  the  presence  of  at  least  one  other  variable 
along  its  Generalized  Principal  Component  in  the  record 

Other  Applications  of  This  Technique 

Another  use  of  the  Identity  Map  approach  of  modeling  relationships  and  exploiting  redundancy 
lies  in  the  fact  that  a  trained  Identity  Map  Network  contains  within  it  the  an  estimate  of  the 
Generalized  Principal  Components  of  the  input  data.  This  approach  of  data  compression  can  be 
used  to  reduce  the  size  of  the  input  variable  set  for  modeling  without  losing  valuable  modeling 
information  in  the  process.  This  will  be  discussed  in  detail  in  the  section  entitled  “Generalized 
Principal  Components”. 

Summary  of  Missing  Value  Prediction 

The  approach  outlined  above  for  solution  to  the  missing  value  prediction  problem  is  a  novel 
technique  that  can  effectively  address  the  “cleaning”  of  large  databases.  Additionally,  since  this 
technique  is  closely  coupled  with  the  solution  to  the  Data  Redundancy  Removal  problem,  these 
components  and  the  associated  reduced  dimensionality  feature  vectors  are  available  as  a 
collateral  advantage. 


Bad  Data  Detection 

Motivation  for  Bad  Data  Detection 

Databases  often  contain  incorrect  data.  This  occurs  for  many  reasons,  including  typographical 
errors  during  data  entry  and  misinterpretation  of  the  meaning  of  database  fields.  If  left  in  the 
training  set  for  the  neural  network,  these  “bad  records”  can  significantly  degrade  the 
performance  of  the  model.  As  mentioned  before,  the  performance  of  the  model  is  only  as  good 
as  the  data  from  which  it  is  built.  Eliminating  erroneous  records  in  the  database  will 
significantly  improve  model  performance. 

A  formal  definition  of  the  bad  record  identification  problem  can  be  stated  as:  “Given  a  data  set 
containing  some  bad  records,  determine  a  subset  of  the  records  which  have  fewer  than  a 
specified  percentage  of  bad  records.  Furthermore,  given  a  specific  record,  classify  it  as  good 
or  bad  along  with  an  estimate  of  the  confidence  of  classification.” 

The  following  sections  detail  the  proposed  approach  to  solving  the  Bad  Data  Detection 
problem.  The  discussion  in  these  sections  first  defines  what  is  meant  by  “Bad  Data”.  It  then 
continues  with  a  discussion  of  Robust  Estimation.  It  proceeds  with  a  description  of  how  to 
obtain  t  robust  variation  of  the  Back-Propagation  Learning  Law.  Finally,  it  describes  how  a 
Robust  Identity  Map  Network  can  be  used  to  solve  the  Bad  Data  Detection  problem. 

Definition  of  Bad  Data 

It  is  important  to  define  what  is  meant  by  the  term  “bad  data”  in  this  context  It  is  not  possible 
to  look  at  a  piece  of  data  and  state  definitively  whether  it  is  bad  or  not.  Nevertheless,  it  is 
feasible  to  determine  whether  it  is  a  common  or  uncommon  observation.  That  is,  one  can 
determine  if  there  are  many  other  similar  observations  in  the  data  set  or  if  this  is  a  unique 
observation.  The  term  “unique”  is  used  loosely  to  mean  that  there  are  no  observations  close 
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to  the  data  point  under  consideration.  A  more  precise  (but  perhaps  less  intuitive)  term  for  a 
unique  data  point  would  be  an  outlier,  a  data  point  that  lies  far  from  the  mainstream. 

If  an  observation  is  unique  in  the  data  set,  it  may  be  an  error  or  it  may  be  a  rare  but  correct 
observation.  If  it  is  an  error,  it  should  be  eliminated.  If  it  is  rare  but  correct,  it  should  still  be 
eliminated  because  conclusions  made  on  small  numbers  of  observations  are  not  statistically 
valid.  Thus  it  is  fair  to  say  that  observations  that  are  unique  or  occur  very  infrequently  should 
be  dropped  from  the  data  set.  Such  observations  are  termed  “bad  data”. 

Concept  of  Robust  Estimation 

Standard  Back-Propagation  as  a  Least-squares  Estimator 

The  standard  Back-Propagation  Learning  L?w  for  neural  networks  performs  gradient 
descent,  iteratively  improving  the  network  weights  to  minimize  the  squared  error  of  the 
network  across  the  training  set.  Since  the  function  being  minimized  is  the  squared  error,  the 
optimality  condition  is  the  least- squares  criterion.  Optimization  with  respect  to  the  least- 
squares  criterion  provides  maximum  likelihood  estimates  of  the  neural  network  weights  in 
the  case  that  the  error  has  a  Gaussian  distribution.  Bad  records  in  the  database  are 
problematic  to  the  modeling  process  if  they  create  outliers,  which  are  data  points  lying  far 
from  the  mainstream.  If  there  are  outliers  in  the  data,  the  error  in  the  data  will  not  have  a 
Gaussian  distribution.  Thus,  if  the  optimality  condition  is  the  least-squares  criterion, 
outliers  will  cause  the  network  to  deviate  from  the  maximum  likelihood  estimates  of  the 
weights. 

Gaussian  Distribution  of  Error 

An  approach  to  solving  this  problem  lies  in  using  something  other  than  the  least-squares 
criterion  as  the  optimality  condition.  Specifically,  do  not  assume  the  errors  to  have  a 
Gaussian  distribution.  With  a  Gaussian  distribution  of  error,  the  probability  of  large  errors 
decreases  supcr-exponentially  with  the  absolute  value  of  the  error.  Thus  if  the  network 
detects  a  large  error,  it  decides  it  to  be  more  probable  that  its  weights  are  wrong  than  that 
the  data  point  is  wrong  (since  the  probability  that  the  data  point  is  so  wrong  is  super- 
exponentially  small).  Hence  it  makes  a  large  correction  in  its  weights  in  the  direction  of  the 
data  point  with  the  large  error.  The  correction  in  the  weights  is  proportional  to  the  size  of 
the  error.  As  the  size  of  the  error  goes  to  infinity,  the  network’s  response  increases  without 
bound.  In  a  sense,  the  network  acts  “gullible”  in  the  presence  of  outliers  when  using  the 
standard  learning  law  because  it  blindly  believes  that  every  piece  of  data  it  sees  is  correct. 
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Figure  12:  Response  of  the  Network  vs.  Size  of  the  Error  Assuming  a  Gaussian  Distribution 


If  the  error  is  assumed  to  have  a  fat-tailed  distribution,  such  as  a  Cauchy  distribution,  the 
network  will  behave  differently.  If  it  encounters  a  data  point  that  causes  a  large  error,  it 
decides  that  the  data  point  is  more  likely  to  be  wrong  than  right  Hence  it  makes  practically 
no  correction  in  its  weights  when  it  encounters  large  errors.  This  leads  to  behavior  that  can 
be  considered  “skeptical”  in  the  presence  of  outliers.  However,  this  kind  of  behavior  can 
sometimes  prevent  the  network  from  learning  correctly  at  all. 


Figure  13:  Response  of  Network  vs.  Size  of  the  Error  Assuming  a  Cauchy  Distribution 
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Logistic  Distribution  of  Error 


A  compromise  between  the  above  two  behaviors  can  be  obtained  by  assuming  the  error  to 
have  a  Logistic  distribution.  With  a  learning  law  based  on  this  assumption,  the  network 
will  react  similarly  to  moderate  as  well  as  large  errors.  Hence  this  kind  of  behavior  can  be 
considered  to  be  neither  “gullible”  nor  “skeptical",  but  “open-minded”  in  the  presence  of 
outliers.  Using  this  approach,  the  network  will  be  influenced  by  all  points,  but  will 
gravitate  towards  those  that  appear  consistently. 

Assuming  a  Logistic  distribution  of  error,  the  reaction  of  the  network  to  large  errors  will 
not  be  proportional  to  the  error,  but  to  the  hyperbolic  tangent  of  the  error.  The  hyperbolic 
tangent  function  lies  in  the  range  -1  to  1  everywhere  and  approaches  the  limits  -1  and  +1  at 
minus  and  plus  infinity  respectively.  Thus  no  error,  however  large,  will  cause  the  network 
to  be  influenced  by  more  than  a  constant  amount 


Figure  14:  Response  of  the  Network  vs.  Size  of  the  Error  Assuming  a  Logistic  Distribution 


This  is  the  recommended  modification  to  the  Back -Propagation  learning  law  for  data 
containing  outliers.  In  the  following  section  this  is  used  to  perform  robust  estimation  of  the 
Data  Manifold.  The  term  “Robust  Back-Propagation”  is  used  to  refer  to  Back-Propagation 
with  the  hyperbolic  tangent  function  applied  to  the  error  term. 

It  should  be  noted  that  there  is  an  expected  tradeoff  in  the  speed  of  convergence  of  Robust 
Back-Propagation  in  exchange  for  the  property  of  robustness.  This  is  due  to  the  fact  that  the 
weights  will  not  take  large  steps  in  response  to  large  errors,  and  thus  will  need  more 
iterations  with  smaller  steps  at  each  iteration  in  order  to  converge  the  same  amount  Given 
that  the  ADAM  Workstation  will  be  based  on  the  high  throughput  SNAP  hardware,  it  is  not 
felt  that  this  is  a  concern. 
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Robust  Estimation  of  the  Data  Manifold 

Previously,  estimation  of  the  Data  Manifold  using  the  Identity  Map  Network  was  described. 
That  method  used  the  standard  Back-Propagation  learning  law  to  train  the  Identity  Map 
Network.  In  order  to  determine  a  robust  estimate  of  the  Data  Manifold,  the  only  change 
necessary  is  the  use  of  the  Robust  Back-Propagation  learning  law  instead  of  the  standard  Back- 
Propagation  learning  law.  This  network  is  termed  the  “Robust  Identity  Map  Network”. 

Using  the  Robust  Identity  Map  Network  for  Bad  Data  Detection 

Given  the  Robust  Identity  Map  Network,  an  approach  to  the  problem  of  Bad  Data  Detection  is 
addressed  as  follows: 

•  Take  a  data  point  and  feed  it  into  the  trained  network. 

•  Cycle  the  net  once  in  feed-forward  mode  and  observe  the  output 

•  Compute  the  error  of  reconstruction. 

•  If  the  error  of  reconstruction  is  above  a  threshold  value  declare  the  data  point  to  be 
erroneous. 

•  Otherwise  declare  the  data  point  to  be  acceptable. 

The  concept  is  that  since  the  network  has  been  trained  to  reconstruct  points  on  the  mainstream 
of  the  Data  Manifold  correctly,  badly  reconstructed  points  do  not  lie  on  the  Data  Manifold.  This 
concept  is  in  unison  with  the  basic  ideas  of  Robust  Estimation,  where  a  large  error  is  treated  as 
evidence  that  a  data  point  may  be  an  outlier,  and  thus  bad  data.  Since  the  network  was  trained 
using  Robust  Back-Propagation,  this  approach  would  be  valid  even  for  the  data  on  which  the 
network  was  trained.  That  is,  one  could  expect  large  errors  from  outliers  in  the  training  data, 
which  would  not  be  the  case  if  ordinary  Back-Propagation  had  been  used. 

The  size  of  the  error  relative  to  the  good/bad  decision  threshold  provides  a  confidence  measure 
for  the  classification  decision.  In  order  to  build  a  subset  of  the  records  with  fewer  than  a 
specified  percentage  of  bad  records,  one  would  merely  have  to  adjust  the  threshold 
appropriately  and  extract  the  records  that  were  classified  as  “good”  from  the  data  set. 

Summary  of  Bad  Data  Detection 

This  approach  to  the  Bad  Data  Detection  problem  can  be  used  to  “clean”  existing  databases. 
Additionally,  this  modification  to  the  basic  backpropagation  learning  law  could  be  applied 
directly  to  modeling  in  order  to  build  better  quality  models. 


Data  Redundancy  Removal 

Motivation  for  Data  Redundancy  Removal 

The  problem  of  Data  Redundancy  Removal  arises  in  model  reduction,  where  it  is  desired  to 
determine  a  significantly  reduced  subset  of  the  variables  of  a  data  set  that  is  capable  of 
prediction  performance  comparable  to  the  full  variable  set.  Model  reduction  is  of  particular 
importance  in  situations  where  the  cost  of  collecting  and/or  processing  the  data  is  high  relative 
to  the  returns.  Data  Redundancy  Removal  is  also  essential  where  post-modeling  explanation  is 
involved,  since  valid  explanation  requires  the  input  to  the  model  to  be  non-redundant,  ie. 
reduced.  If  there  is  redundancy  in  the  input  data,  the  network  has  a  large  amount  of  choice 
regarding  how  it  wishes  to  build  the  model  using  that  redundant  information.  For  example: 


3/26/91 


SBIR  Final  Report 


Page  30 


•  Consider  a  three-variable  set  consisting  of  X,  Y,  and  Z.  Assume  that  X=Y=Z.  For 
simplicity  consider  a  linear  network  model  with  Z  as  the  output.  In  other  words,  the 
network  is  trying  to  predict  Z  on  the  basis  of  X  and  Y. 

•  The  network  may  choose  to  predict  Z  using  X,  Y,  or  a  linear  combination  of  X  and  Y. 
The  learning  law  will  not  bias  it  either  way  since  both  contain  identical  information. 
Consequently  the  network  will  make  a  random  decision  on  this  point 

•  After  network  training  is  completed,  the  next  phase  is  typically  network  explanation 
using  Sensitivity  Analysis  for  aggregated  information  or  Knowledge-Net  for  record- 
specific  explanation.  Here,  partial  derivatives  of  the  network  output  with  respect  to  its 
input  are  computed. 

•  Conceptually,  the  explanation  results  give  the  importance  of  each  input  in  determining 
the  output.  In  this  case,  the  network  may  have  randomly  chosen  to  predict  Z  using  X, 
Y,  or  a  linear  combination  of  X  and  Y.  In  each  of  these  cases,  the  explanation  results 
will  be  different.  In  each  case  the  explanation  will  be  correct,  but  none  will  provide  a 
complete  explanation  of  the  relationships  between  the  variables. 

The  example  above  shows  that  redundancy  in  the  input  data  can  confuse  the  results  of  neural 
network  explanation.  The  statistical  technique  of  Principal  Component  Analysis  is  capable  of 
removing  redundancy  caused  by  linear  correlations  in  the  data.  However,  there  is  further 
redundancy  caused  by  nonlinear  relationships  in  the  data.  For  reliable  operation  of  the 
explanation  mechanism,  this  redundancy  must  be  removed  also.  Data  Redundancy  Removal 
provides  the  solution  to  this  problem. 

More  formally,  the  redundancy  elimination  problem  can  be  stated  as:  “Given  a  data  set 
consisting  of  k  variables,  determine  a  minimal  subset  of  the  variables  that  contains  the 
information  preseat  in  the  entire  variable  set,  and  use  it  to  reconstruct  the  entire  variable  set  to 
within  a  certain  mean  squared  error  tolerance.” 

Concepts  of  Data  Redundancy  Removal 

Data  Redundancy  Removal  can  be  viewed  as  a  generalization  of  the  traditional  statistical 
method  of  Principal  Component  Analysis.  To  lay  the  groundwork  for  Data  Redundancy 
Removal,  the  concepts  of  Principal  Component  Analysis  are  described  here. 

Basic  Concepts  of  Principal  Component  Analysis 

Given  a  data  set,  one  may  compute  its  correlation  matrix,  a  square  matrix  whose  entries 
represent  the  strength  of  the  linear  relationships  between  the  variables.  Then,  one  may 
perform  an  eigen-analysis  on  this  correlation  matrix.  The  eigenvectors  of  the  coirelation 
matrix  are  termed  the  Principal  Components  of  the  data  set.  These  Principal  Components 
are  derived  variables  that  contain  information  about  redundancy  in  the  input  data  caused  by 
linear  correlation.  It  should  be  noted  that  these  derived  variables  are  related  to  the  original 
variables  through  purely  linear  relationships. 

The  Principal  Components  are  uncorrelated  with  respect  to  one  another  and  thus  form  a 
(linearly)  non-redundant  set  that  could  be  used  to  completely  reconstruct  the  input  data.  The 
eigenvalues  of  the  correlation  matrix  provide  a  measure  of  the  amount  of  reconstruction 
information  content  associated  with  each  eigenvector.  Specifically,  the  largest  eigenvalues 
are  associated  with  the  variables  having  the  most  information  for  reconstruction. 
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In  order  to  specify  the  input  data  in  a  non-redundant  way,  it  would  possible  to  specify  the 
derived  variables  associated  with  the  eigenvectors  having  the  largest  eigenvalues  and  drop 
the  rest  of  the  derived  variables.  A  specified  reconstruction  error  tolerance  would  determine 
how  many  derived  variables  (Principal  Components)  to  retain.  No  information  beyond  the 
acceptable  error  tolerance  would  be  lost  in  this  process. 

One  approach  to  generalizing  Principal  Component  Analysis 

Linear  Principal  Component  Analysis  is  capable  of  removing  redundancy  caused  by  linear 
correlations  in  the  data.  Further  redundancy  can  be  removed  by  considering  the  nonlinear 
relationships  in  the  data.  This,  however,  must  be  done  with  care,  as  the  following 
discussion  illustrates. 

Principal  Component  Analysis  uses  an  eigen-analysis  of  the  correlation  matrix  in  order  to 
determine  a  set  of  derived  variables  that  provide  a  linearly  non-redundant  specification  of 
the  input  data  to  within  a  certain  tolerance.  Relationship  Discovery  provides  a  nonlinear 
equivalent  of  the  correlation  matrix  called  the  Relationship  Matrix.  A  natural  approach  that 
suggests  itself  is  to  perform  an  eigen-analysis  of  the  Relationship  Matrix  to  determine  a  set 
of  derived  variables  that  are  non-redundant  with  respect  to  linear  as  well  as  nonlinear 
relationships. 

However,  a  subtle  point  arises  here.  When  all  the  redundancy  that  is  removed  is  caused  by 
linear  relationships  in  the  data,  the  input  data  can  be  directly  reconstructed  from  the  non- 
redundant  representation  with  a  linear  transformation.  However,  if  the  redundancy  that  is 
removed  is  caused  by  a  nonlinear  relationship  in  the  data,  one  must  be  careful  to  ensure  that 
the  reconstruction  is  feasible.  In  particular,  while  all  the  linear  functions  represented  in  the 
correlation  matrix  are  invertible,  the  nonlinear  functions  represented  in  the  relationship 
matrix  arc  often  not  invertible. 

For  example,  it  is  possible  to  predict  uniquely  from  X.  However,  it  is  not  possible  to 

predict  X  uniquely  from  X^.  Thus,  when  variables  are  highly  related  it  is  not  always 
possible  to  predict  an  arbitrarily  chosen  one  from  another.  Hence,  one  must  use  an 
approach  which  guarantees  that  reconstruction  of  the  original  data  from  the  reduced 
representation  is  possible.  This  approach  is  described  below. 

The  identity  Mao  Network  for  Data  Redundancy  Removal 

The  Identity  Map  Network  described  earlier  for  Missing  Value  Prediction  computes  what 
may  be  termed  die  “Generalized  Principal  Components”  of  the  data  set  Furthermore,  it 
guarantees  that  reconstruction  is  possible  by  actually  performing  the  reconstruction.  Thus 
the  Identity  Map  approach  is  a  viable  way  of  performing  Data  Redundancy  Removal.  The 
details  are  described  in  the  following  section. 

Data  Redundancy  Removal  using  the  Identity  Map  Network 

The  Identity  Map  Network  is  built  as  follows: 

•  Assume  a  k-variable  data  set  is  given.  Start  with  a  net  architecture  having  k  input 
nodes,  one  node  in  the  hidden  layer,  and  k  output  nodes.  Train  this  network  using  the 
Robust  Back -Propagation  Learning  Algorithm. 

•  When  the  mean  squared  error  of  the  network  converges,  stop  training  on  the  weights 
associated  with  this  part  of  the  network. 
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•  Then  add  a  new  node  in  the  hidden  layer  along  with  all  its  associated  weights.  Train 
the  weights  associated  with  this  new  node,  while  keeping  the  weights  associated  with 
the  first  node  constant. 

•  Continue  the  process  of  adding  new  nodes  until  an  acceptable  reconstruction  quality  is 
achieved. 

In  attempting  to  transmit  the  input  information  through  the  information  bottleneck  in  the  hidden 
layer,  the  network  will  learn  derived  variables  that: 

•  Can  reconstruct  the  input  to  the  desired  accuracy. 

•  Do  not  carry  redundant  information. 

•  May  be  derived  in  a  nonlinear  way  from  the  input  data. 

These  are  nothing  more  than  the  Generalized  Principal  Components  of  the  data.  Note  that  the 
derived  variables  are  exactly  the  variables  that  parameterize  the  Data  Manifold.  However, 
though  the  Generalized  Principal  Components  of  the  data  have  been  obtained,  the  problem  of 
Data  Redundancy  Removal  has  not  been  completely  solved.  The  next  section  details  the 
remaining  problem  and  its  the  solution. 

Motivation  for  Realigning  Generalized  Principal  Components 

In  Principal  Component  Analysis,  the  use  of  derived  variables  is  an  inconvenience.  It 
would  be  preferable  to  merely  select  a  subset  of  the  input  variables  (rather  than  a  set  of 
derived  variables)  that  could  be  used  to  reconstruct  all  the  rest  of  the  variables.  In  some 
situations  the  cost  of  collecting  the  data  is  high,  so  there  is  even  more  incentive  to  find  a 
small  subset  of  the  variables  that  carries  almost  all  the  information  in  the  full  set.  Another 
important  problem  is  the  difficulty  of  assigning  meaning  to  derived  variables,  which  causes 
explanation  quality  to  suffer.  Thus,  there  is  significant  motivation  to  seek  a  method  of 
selecting  a  subset  of  the  input  variables  that  carry  the  information  content  of  the  full  set. 

The  term  for  finding  this  subset  of  the  input  variables  from  the  Generalized  Principal 
Components  is  “realigning”  the  Generalized  Principal  Components 

Realigning  Linear  Principal  Components 

In  Principal  Component  Analysis,  this  “Realignment”  capability  exists,  and  is  termed 
“rotation”  of  the  Principal  Components.  The  term  arises  for  the  following  reason.  The 
Principal  Components  represent  an  orthogonal  set  of  coordinate  axes  in  the  input  space. 
They  span  almost  all  the  data,  though  not  all  of  the  space.  The  Principal  Components  are 
not  in  general  aligned  a'ong  the  original  coordinate  axes.  Thus,  they  form  a  set  of  variables 
derived  by  a  general  linear  transformation  of  the  input  variables,  not  merely  a  subset  of  the 
input  variables.  However,  if  tit;  Principal  Components  are  rotated  within  the  space  they 
span  in  such  a  way  that  each  input  variable  has  a  nonzero  projection  on  exactly  one 
Principal  Component,  the  result  is  a  “derived”  variable  set  that  is  actually  a  subset  of  the 
original  variable  set. 

Realigning  Generalized  Principal  Components 

In  the  case  of  Generalized  Principal  Component  Analysis,  the  objective  is  the  same  but  the 
approach  is  different.  A  mere  rotation  cannot  align  the  Generalized  Principal  Components 
with  the  input  variables  because  of  the  nonlinear  relationships  embodied  in  the  Generalized 
Principal  Components.  However,  consider  the  following.  The  first  half  of  the  Identity  Map 
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Network  is  performing  a  transformation  that  generates  the  derived  variables,  while  the 
second  half  of  the  Identity  Map  Network  is  regenerating  the  input  variables  from  the 
derived  variables.  Place  a  restriction  on  the  first  half  of  the  network  such  that  only  a 
selection  is  allowed,  and  not  a  general  transformation.  Then,  the  Generalized  Principal 
Components  that  the  network  generates  will  be  necessarily  a  mere  selection  of  the  input 
variables.  The  constraints  on  the  network  necessary  to  achieve  this  are  commonly  used  in 
Linear  Programming  problems,  where  they  are  termed  the  “Assignment  Constraints”.  Here 
they  are  referred  to  as  the  “Selection  Constraints”. 

Use  of  the  "Selection  Constraints”  transforms  the  neural  network  learning  process  from  an 
unconstrained  optimization  problem  to  a  constrained  optimization  problem.  There  is  a 
standard  technique  used  in  optimization  to  deal  with  this  situation:  convert  die  problem  back 
to  unconstrained  optimization  by  phrasing  the  constraints  as  terms  in  the  objective  function. 
That  is,  make  the  Selection  Constraints  penalty  terms  in  the  Back-Propagation  Error 
function. 

The  Solution  to  the  Data  Redundancy  Removal  Problem 

Thus,  by  modifying  the  Back-Propagation  learning  law  to  penalize  violation  of  the 
Selection  Constraints  for  the  Identity  Map  Network,  the  network  can  be  made  to  select  a 
small  subset  of  its  inputs  to  reconstruct  the  output.  This  provides  a  way  of  building  a 
reduced  set  of  input  variables  that  do  not  contain  redundancy.  This  is  the  solution  to  the 
Data  Redundancy  Removal  problem. 

Summary  of  Data  Redundancy  Removal 

In  conclusion.  Data  Redundancy  Removal  can  be  solved  using  the  approach  described.  The 
technique  is  closely  aligned  with  both  Missing  Value  Prediction  and  Bad  Data  Detection.  This 
offers  significant  leverage  in  the  technical  development  cycle  and  significantly  reduces  risk. 
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Conclusions  and  Discussion 


Results  of  Phase  I 

All  of  the  Phase  I  technical  objectives  have  been  met  and  alternate  approaches  to  the  relationship 
discovery  process  have  been  developed.  Furthermore,  a  significant  amount  of  refinement  on  the 
relationship  discovery  tool  has  been  performed.  This  has  resulted  in  a  fully  functional  capability 
that  is  currently  in  use  at  HNC.  The  Relationship  Discovery  software  tool  along  with  an  executable 
copy  of  the  HNC  Data  Base  Mining  software  will  be  delivered  to  the  Army  (AIRMICS)  along  with 
this  Final  Report 

Chi-squared  Relationship  Discovery 

An  automated  Relationship  Discovery  algorithm  based  on  a  variant  of  the  Chi-squared  test  has 
been  designed  and  implemented.  The  Chi-squared  test  is  a  well-known  test  of  statistical 
independence.  In  Relationship  Discovery,  it  is  used  on  the  projected  Probability  Density 
Function.  This  approach  has  been  tested  on  artificial  data  containing  known  relationships,  and 
has  achieved  a  100%  success  rate  in  determining  whether  a  relationship  exists  or  does  not  exist 
in  test  problems. 

Relationship  Discovery  via  the  Chi-squared  test  is  a  theoretically  rigorous  statistically-based 
approach  to  the  problem.  It  detects  all  bivariate  (second  order)  relationships  in  the  data,  and 
detects  the  projections  of  most  multivariate  (higher  order)  relationships  found  in  real-life  data. 
These  detected  relationships  can  then  be  visualized  since  they  are  all  of  low  order. 

Probing-of-K-space  Relationship  Discovery 

An  alternate  algorithm  has  also  been  designed  and  implemented.  This  is  termed  “Probing  of  K- 
space”,  and  is  based  on  the  idea  of  determining  which  other  variables  change  when  one 
variable  changes  in  the  projected  Probability  Density  Function.  This  technique  has  been  tested 
on  artificial  data  containing  known  relationships,  and  has  been  able  to  rank  the  strengths  of 
relationships  among  the  variables  successfully.  This  approach  requires  the  user  to  specify  a 
cutoff  strength  value  below  which  relationships  will  be  deemed  to  be  insignificant. 

The  Probing  of  K-space  approach  is  an  alternate  heuristic  approach  to  the  Relationship 
Discovery  problem.  It  is  more  susceptible  to  noise  than  the  Chi-squared  test  approach,  and 
also  is  more  computation-intensive.  However,  it  has  been  successful  in  ranking  relationships  in 
order  of  decreasing  strength  in  test  cases. 

Testing 

Both  of  the  above-mentioned  approaches  have  also  been  tested  on  real  data  obtained  from 
civilian  customers  of  HNC.  The  results  in  these  cases  have  been  positive,  since  the  algorithms 
captured  many  expected  and  intuitively  obvious  relationships  in  the  data,  as  well  as  some  that 
were  not  as  intuitively  obvious.  The  strongest  relationships  that  fell  in  the  latter  category  were 
inspected  visually  and  it  was  confirmed  that  they  did  exist 
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Relationship  Aggregation 

An  algorithm  for  Relationship  Aggregation  has  been  designed  and  implemented.  This 
algorithm  derives  a  measure  of  the  “distance”  between  pairs  of  variables  from  the  previously 
obtained  relationship  strengths,  and  treats  them  as  weights  in  a  complete,  weighted,  undirected 
graph.  The  algorithm  then  finds  the  shortest  path  between  variables  in  this  graph,  builds  a  tree 
of  such  shortest  paths,  and  presents  this  tree  to  the  user.  This  tree  captures  the  most  relevant 
relationships  in  the  data  and  allows  the  user  to  grasp  them  readily. 

Relationship  Aggregation  using  the  Shortest  Path  approach  is  a  heuristic,  visually  oriented 
method  of  helping  the  user  to  analyze  the  variable  structure  and  determine  a  reduced  model  to 
predict  a  given  variable.  It  is  able  to  analyze  the  output  of  either  of  the  two  methods  of 
Relationship  Discovery. 

Integrated  Relationship  Discovery  Tool 

An  Integrated  Relationship  Discovery  software  tool  has  been  built.  This  tool  contains  the  two 
forms  of  Relationship  Discovery,  Relationship  Aggregation  and  a  Kohonen-based  data 
compression  facility  necessary  to  make  the  Relationship  Discovery  algorithms  run  efficiently. 


Necessary  Enhancements  Identified  in  Phase  I 

During  the  attainment  of  these  results  and  as  a  result  of  the  testing  of  these  concepts  on  data  sets, 
several  necessary  enhancements  to  the  overall  Data  Base  Mining  system  concept  were  identified. 
These  are: 

•  Missing  Value  Prediction:  This  capability  will  provide  a  way  of  systematically  replacing 
missing  values  in  the  data  set  with  maximum  likelihood  estimates  of  their  true  values. 

•  Bad  Data  Detection:  This  capability  will  help  provide  identification  of  potentially  “bad” 
records  in  the  data  set 

•  Data  Redundancy  Removal:  This  capability  is  necessary  to  optimize  the  model  explanation 
capability  and  eliminate  ambiguous  results. 

The  first  two  of  these.  Missing  Value  Prediction  and  Bad  Data  Detection,  together  comprise 
Automatic  Data  Geaning. 

These  enhancements  identified  during  the  Phase  I  effort  are  being  proposed  for  implementation 
during  Phase  II. 


Summary  of  Results  Obtained  in  Phase  I 

In  summary,  the  following  results  were  achieved  during  Phase  I  of  this  SBIR  project: 

•  Implementation  and  demonstration  of  an  automatic  approach  to  relationship  discovery  via 
2-D  subspace  projections. 

•  Development  of  a  Chi-squared  test  to  find  relationships. 

•  Implementation  of  the  capability  for  a  scatter-plot  display  for  quick  visualization  of 
relationships. 
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•  Development  of  the  capability  to  perform  efficient  Parzen  windowing  on  the  approximated 
PDF  to  generate  3-D  surface  plots  of  the  PDF  projections. 

•  Development  of  the  capability  to  display  relationships  in  rank  space. 

•  Development  of  an  approach  to  the  model  reduction  problem  via  the  shortest  path  method 
using  the  Chi-squared  test  and  Cramer's  coefficient  to  generate  the  Shortest  Path  Tree. 

•  Implementation  of  these  capabilities  in  software  and  verification  of  software  performance 
on  real  and  artificial  data. 

•  Enhancement  of  the  general  software  capabilities  of  the  Relationship  Discovery  tool: 

-  Increase  in  utilization  of  neurocomputer  processing  capabilities  to  reduce  processing 
time. 

-  Enhancement  of  usability  through  implementation  of  a  simple  graphical  user  interface. 

-  Integration  of  the  software  into  a  single  coalesced  environment. 

•  Conceptualization  of  approaches  to  three  necessary  enhancements  to  the  Data  Base  Mining 
Concept. 
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Appendix  1 

Chi-squared  Relationship  Discovery  on  Test  Data  Set  1 


This  appendix  contains  the  output  of  Chi-squared  Relationship  Discovery  run  on  Test  Data  Set  1 . 
The  equations  used  to  generate  this  test  data  set  ait: 

X  *  Gaussian  random 
Linear  =  2*X+ncrise/SN 
Quadratic  =  X*X  +  noise/SN 
Cosine  =  cos(X)  +  noise/SN 
Random  =  Gaussian  random 

As  mentioned  in  die  text,  die  constant  SN  is  similar  to  a  signal-to-noise  ratio  to  die  extent  that  small 
values  of  SN  imply  large  amounts  of  noise  being  added  to  the  function  values. 

Five  different  values  of  SN  were  used.  The  value  of  SN  for  each  run  is  printed  on  the  output 
page. 

The  number  of  data  points  used  was  1000. 
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Appendix  2 

Chi-squared  Relationship  Discovery  on  Test  Data  Set  2 


This  appendix  contains  the  output  of  Chi-squared  Relationship  Discovery  run  on  Test  Data  Set  2. 
The  equations  used  to  generate  this  test  data  set  are: 

XI  =  Gaussian  random 
X2  =  Gaussian  random 
Linear  =  X1+X2  +  noise/SN 
Quadratic  =  X1*X1  +  X2  +  noise/SN 
Random  =  Gaussian  random 

As  mentioned  in  the  text,  the  constant  SN  is  similar  to  a  signal-to-noise  ratio  to  the  extent  that  small 
values  of  SN  imply  large  amounts  of  noise  being  added  to  the  function  values. 

Five  different  values  of  SN  were  used.  The  value  of  SN  for  each  run  is  printed  on  the  output 
page. 

The  number  of  data  points  used  was  1000. 
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Appendix  3 

Probing  of  K-space  on  Test  Data  Set  1 


This  appendix  contains  the  output  of  Probing  of  K-space  run  on  Test  Data  Set  1.  The  equations 
used  to  generate  this  test  set  are: 

X  =  Gaussian  random 
Linear  =  2*X+noise/SN 
Quadratic  =  X*X  +  noise/SN 
Cosine  =  cos(X)  +  noise/SN 
Random  =  Gaussian  random 

As  mentioned  in  the  text,  the  constant  SN  is  similar  to  a  signal-to-noise  ratio  to  the  extent  that  small 
values  of  SN  imply  large  amounts  of  noise  being  added  to  the  function  values. 

Five  different  values  of  SN  were  used.  The  value  of  SN  for  each  run  is  printed  on  the  output 
page. 

The  number  of  data  points  used  was  1000. 
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Appendix  4 

Probing  of  K-space  on  Test  Data  Set  2 


This  appendix  contains  the  output  of  Probing  of  K-space  run  on  Test  Data  Set  2.  The  equations 
used  to  generate  this  test  data  set  are: 

XI  =  Gaussian  random 
X2  =  Gaussian  random 
Linear  =  X1+X2  +  noise/SN 
Quadratic  =X1*X1  +X2  +  noise/SN 
Random  *  Gaussian  random 

As  mentioned  in  the  text,t  he  constant  SN  is  similar  to  a  signal-to-noise  ratio  to  the  extent  that  small 
values  of  SN  imply  large  amounts  of  noise  being  added  to  the  function  values. 

Five  different  values  of  SN  were  used.  The  value  of  SN  for  each  run  is  printed  on  the  output 
page. 

The  number  of  data  points  used  was  1000. 
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Appendix  5 

Chi-squared  Relationship  Discovery  on  Civilian  Customer  Data 


This  appendix  contains  the  output  of  Chi-squared  Relationship  Discovery  run  on  a  data  set 
obtained  from  a  civilian  customer  of  HNC.  This  data  set  contained  data  on  cellular  phone  usage 
and  billing. 

Also  included  are  surface  and  contour  plots  of  two  relationships  that  were  of  interest  to  the 
customer. 
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11.44 

RELATIONSHIP  DETECTED 

OlACSCHG  -- 

01 SEP LAN  8: 

11.44 

RELATIONSHIP  DETECTED 

OlACSCHG  — 

OIRCFCflG: 

11.44 

RELATIONSHIP  DETECTED 
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OTAMTYPE  2: 

11.35 

RELATIONSHIP  DETECTED 
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11.04 

RELATIONSHIP  DETECTED 

01AMAR30  — 

OlAMCODO : 

11.01 

RELATIONSHIP  DETECTED 

01 LCLLND  — 

01MINPEK : 

10.98 

RELATIONSHIP  DETECTED 

OlACSCHG  — 
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OTACDTCL  1: 
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10.88 
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01RCFCHG: 
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OTACDTCL  13: 
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10.44 
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01 SEP LAN  14  — 

OTNOMDAY: 

10.38 
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OlAMCODO  — 

01FTRT0T: 

10.11 

RELATIONSHIP  DETECTED 

OlACSCHG  — 

01ANPH0N : 

10.08 

RELATIONSHIP  DETECTED 

01SEPLAN  11  — 

01SEPLAN  14: 

10.07 

RELATIONSHIP  DETECTED 

01 SEP LAN  25  — 

OTACDTCL  6: 

10.03 
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01 SEP LAN  10  — 

OTACDTCL  13: 

9.92 

RELATIONSHIP  DETECTED 

01INTRST  — 

01 SEP LAN  2: 

9.81 

RELATIONSHIP  DETECTED 

OlAMTODO  — 

01 ROMS Kg : 

9.77 

RELATIONSHIP  DETECTED 

01 SEP LAN  25  — 

OTACDTCL  1: 

9.76 

RELATIONSHIP  DETECTED 

01FTS41  — 

OTACDTCL  8: 

9.70 
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01FTR41  — 
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OlACSCHG  — 
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9.55 
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OlAMCODO  — 

OTNOMDAY: 

9.38 
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OIROMAIR: 

9.31 
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01 SEP LAN  18  — 

OTACDTLM: 

9.28 

RELATIONSHIP  DETECTED 

oiacsChg  — 

01 SEP LAN  25: 

9.15 

RELATIONSHIP  DETECTED 

OlACSCHG  — 

OTACDTCL  8: 

9.13 

RELATIONSHIP  DETECTED 

01FTR85  — 

01 SEP LAN  17: 

9.08 

RELATIONSHIP  DETECTED 

O1AMAR30  — 

01FTR11: 

9.07 

RELATIONSHIP  DETECTED 

01FTR24  — 

OTACDTCL  2: 

9.05 

RELATIONSHIP  DETECTED 

01FTR41  — 

Ol SEP  IAN  2: 

9.02 

RELATIONSHIP  DETECTED 

01 SEP LAN  18  — 

01 SEP LAN  ?4: 

9.00 

RELATIONSHIP  DETECTED 

01FTR99  — 

Ol SEP LAN  17: 

8.96 

RELATIONSHIP  DETECTED 

01SEPLAN  0  — 

01SEPLAN  18: 

8.94 

RELATIONSHIP  DETECTED 

01 SEP LAN  18  — 

OTACDTCL  11: 

8.91 

RELATIONSHIP  DETECTED 

01  SEP  LAN  2  — 

Ol SEP LAN  23: 

8.91 

RELATIONSHIP  DETECTED 

01FTR19  — 

01FTR41: 

8.90 

RELATIONSHIP  DETECTED 

01MINPEK  — 

01 SEP  IAN  18: 

8.86 

RELATIONSHIP  DETECTED 

01  SEP  LAN  6  — 

01 SEP LAN  18: 

8.80 

RELATIONSHIP  DETECTED 

OlACSCHG  — 

01 SEP LAN  9: 

8.78 

RELATIONSHIP  DETECTED 

01 LCLLND  — 

OIROMTOL: 

8.72 

RELATIONSHIP  DETECTED 

01FTR04  — 

01MINPEK : 

8.68 

RELATIONSHIP  DETECTED 

01 SEP LAN  10  — 

01 SEP LAN  11: 

8.66 

RELATIONSHIP  DETECTED 

O1FTR03  — 

OlFTRTOT: 

8.61 

RELATIONSHIP  DETECTED 

01 SEP LAN  10  — 

OTACDTCL  1: 

8.51 
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01FTR19  — 

01 SEP  IAN  1 5: 

8.51 

RELATIONSHIP  DETECTED  i 

OlACSCHG  — 

OIROMTOL: 

8.45 

RELATIONSHIP  DETECTED 

OlACSCHG  — 

Oi'ACDTCL  4: 

8.38 

RELATIONSHIP  DETECTED 

01 SEP LAN  18  — 

01 SEP  LAN  1 9 : 

8.36 

RELATIONSHIP  DETECTED 

oiacsChg  — 

01SEPLAN- 12 : 

8.31 

RELATIONSHIP  DETECTED 

* 

OlAMCODO  — 

01 SEP  IAN  11: 

8.28 

RELATIONSHIP  DETECTED 

01RCFCHG  — 

OISEPLAR  2: 

8.27 

RELATIONSHIP  DETECTED 

OlAMCODO  — 

01SEPLAN  TO: 

8.14 

RELATIONSHIP  DETECTED 

01FTR24  — 

OTACDTCl  8: 

8.05 

RELATIONSHIP  DETECTED 

01MINPEK  — 

OTNOMDAY: 

7.91 

RELATIONSHIP  DETECTED 

OlACSCHG  — 

01SEPLAN  21: 

7.89 

RELATIONSHIP  DETECTED 

OlAMTODO  — 

01SEPIAN  10: 

7.88 

RELATIONSHIP  DETECTED 

OlAMTODO  — 

01FTf41: 

7.84 

RELATIONSHIP  DETECTED 

01FTR99  — 

01 SEP LAN  22: 

7.78 

RELATIONSHIP  DETECTED 

01FTR41  — 

01 SEP LAN  23: 

7.72 

RELATIONSHIP  DETECTED 

OlACSCHG  — 

01FTR41: 

7.71 

RELATIONSHIP  DETECTED 

01FTR85  — 

OlFTRTOT: 

7.64 

RELATIONSHIP  DETECTED 

01RCFCHG  — 

OTNOMDAY: 

7.60 

RELATIONSHIP  DETECTED 

OlACSCHG  — 

OlFTRTOT: 

7.58 

RELATIONSHIP  DETECTED 

01MINPEK  — 

OTACDTLM: 

7.57 

RELATIONSHIP  DETECTED 

01FTR99  — 

01 SEP LAN  1: 

7.49 

RELATIONSHIP  DETECTED 

OlACSCHG  -- 

01 SEP LAN  16: 

7.47 

RELATIONSHIP  DETECTED 

01  SEP  LAN  5  -- 

OTACDTLM: 

7.30 

RELATIONSHIP  DETECTED 

OlAMCODO  — 

01LCLLND: 

7.23 

OlFTRU  — 

OIMINPEK: 

7.21 

OlANPHON  — 

OISEPLAN  25: 

7.18 

OlFTRll  — 

OTACDTCC  8: 

7.05 

OISEPLAN  13  — 

OTACDTCL- 2 : 

7.02 

OlANPHON  — 

01SEPLAN-2: 

6.99 

OISEPLAN  17  — 

OISEPLAN  18: 

6.98 

OISEPLAN  3  — 

OTNUKDAY: 

6.93 

OTACDTCL- 2  — 

OTACDTCL  12: 

6.91 

OISEPLAN- 5  — 

OISEPLAN- 14: 

6.90 

OlACSdG  — 

O1AMAK30: 

6.85 

O1AMTOD0  — 

OTNOMDAY: 

6.84 

OlFTRTOT  — 

OISEPLAN  5: 

6.83 

OISEPLAN  7  -- 

OTNOMDAY : 

6.81 

oiminpEk  — 

OISEPLAN  7: 

6.79 

01FTR99  — 

OISEPLAN- 5: 

6.76 

OTACDTCL  2  — 

OTACDTCL- 3 : 

6.70 

OISEPLAN  18  -- 

OTACDTCL- 1 : 

6.66 

oiamiCdo  — 

OISEPLAN  ?5: 

6.64 

OISEPLAN  24  — 

OTACDTCL  1: 

6.61 

OISEPLAN- 13  — 

OTACDTCL- 8 : 

6.60 

OISEPLAN- 11  -- 

OISEPLAN  ?4: 

6.56 

oiseplaH  0  — 

OTACDTCl  1: 

6.56 

OTACDTCL- 1  — 

OTACDTCL  Tl: 

6.SS 

oiftrtCt  — 

OISEPLAN- 25: 

6.54 

OISEPLAN  0  -- 

01SEPIAN- 11 : 

6.51 

OISEPLAN  Tl  — 

OTACDTCL- 11 : 

6.50 

OlINTRST  — 

01RCFCHG: 

6.50 

01FTR10  — 

OISEPLAN  24: 

6.49 

OTACDTCL  2  -- 

OTACDTCL- 14 : 

6.44 

01FTR.Il  — 

OTACDTCl  8: 

6.44 

OISEPLAN  11  — 

OTACDTCL- 2 : 

6.42 

oiftRii  — 

OISEPLAN  ?5: 

6.42 

OISEPLAN  6  — 

OISEPLAN- 11: 

6.41 

oiftrtCt  — 

OISEPLAN  S: 

6.38 

OTACDTCL  8  — 

OTACDTCL  12: 

6.36 

OlFTRll  — 
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6.34 

01FTR41  — 
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01RCFCHG  — 
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6.33 

O1AMAR30  — 
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6.33 
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OISEPLAN- 7 : 

6.31 

OISEPLAN  18  -- 

OTACDTCL  T3: 

6.30 

01AMAE60  — 

OTACDTCL  2: 

6.26 

01AMT0D0  — 

OISEPLAN- 7 : 
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OISEPLAN  C  — 

OTACDTCL- 1 : 

6.24 

oiamarSo  — 

OIMINPEK: 

6.23 

OlAMTODO  — 

OISEPLAN  5: 

6.19 

OTACDTCL  3  -- 

OTACDTCL- 8 : 

6.16 

oiamar3o  — 

01RCFCHG: 

6.16 

OISEPLAN  11  — 

OISEPLAN  19: 

6.10 

OlAMCffDO  — 

01FTR41: 

6.10 

OISEPLAN  15  -- 

OTACDTLM: 

6.06 

OISEPLAN  10  — 

OTACDTLM: 

6.06 

01FTR43  — 
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oiamcGdo  — 
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OlAMTODO  — 

01PTfi04 : 
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OTACDTCL  1  — 
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01  SEP  LAM  3  — 

01 SEP LAN  11 

4.16 

RELATIONSHIP 

DETECTED 

oiftrTi  — 

OTACbflM 

4.14 

RELATIONSHIP 

DETECTED 

01SEPLAN  20  — 

OTNOMDAY 

4.13 

RELATIONSHIP 

DETECTED 

01SEPIAN- 24  — 

OTACDTCL  4 

4.12 

RELATIONSHIP 

DETECTED 

01 SEP LAM- 16  — 

otnumdXy 

4.11 

RELATIONSHIP 

DETECTED 

OTACDTCl  4  — 

OTACDTCL  11 

4.08 

RELATIONSHIP 

DETECTED 

oirrelo  — 

OIRCF^HG 

4.05 

RELATIONSHIP 

DETECTED 

01ITR19  — 

01FTR99 

4.05 

RELATIONSHIP 

DETECTED 

OTACDTCL  5  — 

OTNOMDAY 

4.01 

RELATIONSHIP 

DETECTED 

OTACDTCL  T2  — 

OTACDTLM 

4.00 

RELATIONSHIP 

DETECTED 

9 

01SEPLAM- 11  — 

01 SEP LAN  15 

3.98 

RELATIONSHIP 

DETECTED 

01 SEP LAM  1  — 

OTACDTCL  1 

3.96 

RELATIONSHIP 

DETECTED 

01  SEP  LAM  T5  — 

otnomdXy 

3.94 

RELATIONSHIP 

DETECTED 

01 SEP  LAN  0  — 

OTAMTYPE  2 

3.94 

RELATIONSHIP 

DETECTED 

01SEPLAN- 1  — 

01 SEP LAN  II 

3.93 

RELATIONSHIP 

DETECTED 

01 SEPLAN- 7  — 

OTACDTCL  8 

3.92 

RELATIONSHIP 

DETECTED 

oiromaTr  — 

01 SEPLAN  18 

3.92 

RELATIONSHIP 

DETECTED 

01MINPEK  — 

01 SEP LAN  14 

3.92 

RELATIONSHIP 

DETECTED 

OTACDTCL  14  — 

otacdTlm 

3.91 

RELATIONSHIP 

DETECTED 

01SEPIAM- 23  — 

01 SEP LAN  24 

3.90 

RELATIONSHIP 

DETECTED 

01SEPLAN- 24  — 

OTACDTCL- 1 3 

3.88 

RELATIONSHIP 

DETECTED 

01 SEPLAN  0  — 

01SEPLAN- 23 

3.87 

REIATIONSHIP 

DETECTED 

01 SEPLAN  ?3  — 

OTACDTCL“ll 

3.86 

RELATIONSHIP 

DETECTED 

OTACDTCL- 11  — 

OTACDTCL  13 

3.84 

RELATIONSHIP 

DETECTED 

01FTR24  — 

oiintRst 

3.82 

RELATIONSHIP 

DETECTED 

01 SEPLAN  6  — 

01 SEP LAN  23 

3.81 

RELATIONSHIP 

DETECTED 

OlAMTOCO  — 

oiftKii 

3.81 

RELATIONSHIP 

DETECTED 

01FTR41  — 

01 SEP LAN  18 

3.81 

RELATIONSHIP 

DETECTED 

01FTRT0T  — 

01SEPLAR  3 

3.80 

RELATIONSHIP 

DETECTED 

01 SEPLAN  0  — 

otaktype-o 

3.79 

RELATIONSHIP 

DETECTED 

01 SEP LAN  T9  — 

otnumday 

3.78 

RELATIONSHIP 

DETECTED 

OlAMtifoO  — 

OTACDTCL  2 

3.78 

RELATIONSHIP 

DETECTED 

OTACDTCL  8  — 

OTAMTYPE- 0 

3,76 

RELATIONSHIP 

DETECTED 

oiromaTr  — 

01 SEPLAN- 6 

3.76 

RELATIONSHIP 

DETECTED 

01ANPH0N  — 

oiftrtSt 

3.76 

REIATIONSHIP 

DETECTED 

OlAMTODO  — 

OTACDTCL  13 

3.75 

RELATIONSHIP 

DETECTED 

01FTR41  — 

01FTR43 

3.74 

RELATIONSHIP 

DETECTED 

01FTR24  — 

OTACDTCL  1 

3.73 

RELATIONSHIP 

DETECTED 

OTACDTCL  2  — 

OTAMTYPE- 2 

3.71 

RELATIONSHIP 

DETECTED 

01AHAR?0  — 

01ANPH0N 

3.70 

RELATIONSHIP 

DETECTED 

OTACDTCL  3  -- 

OTNOMDAY 

3.69 

RELATIONSHIP 

DETECTED 

01R0MT0L  — 

01SEPLAN  18 

3.66 

RELATIONSHIP 

DETECTED 

01 SEPLAN  14  — 

OTACDTCl  8 

3.65 

RELATIONSHIP 

DETECTED 

01 SEPLAN  0  — 

OTACDTCL  13 

3.65 

RELATIONSHIP 

DETECTED 

oiminpEk  — 

01SEPLAR  9 

3.65 

RELATIONSHIP 

DETECTED 

01  SEPLAN  10  — 

OTACDTCL  5 

3.64 

RELATIONSHIP 

DETECTED 

01FTE04  — 

oiftrTi 

3.64 

RELATIONSHIP 

DETECTED 

01 SEPLAN  19  — 

01 SEP LAN  23 

3.63 

RELATIONSHIP 

DETECTED 

01SEPIAN- IS  — 

OTACDTCl  2 

3.62 

RELATIONSHIP 

DETECTED 

oiana5«o  — 

OlAMARCO 

3.61 

RELATIONSHIP 

DETECTED 

01FTR12  — 

01RCTCHG 

3.61 

RELATIONSHIP 

DETECTED 

OTACDTCL  5  — 

OTACDTLM 

3.58 

RELATIONSHIP 

DETECTED 

01 SEPLAN  3  — 

OTACDTCL  1 

3.56 

RELATIONSHIP 

DETECTED 

01SEPLAN- 5  — 

01 SEP LAN  17 

3.55 

RELATIONSHIP 

DETECTED 

01SEPLAN- 4  — 

OTACDTCl  1 

3.54 

RELATIONSHIP 

DETECTED 

oiamcu5o  — 

OTACDTCL- 4 

3.53 

RELATIONSHIP 

DETECTED 

01 SEPLAN  11  — 

01 SEPLAN  13 

3.52 

RELATIONSHIP 

DETECTED 

01 SEPLAN  4  — 

01 SEP  LAN  11 

3.52 

RELATIONSHIP 

DETECTED 

oiamcuBO  — 

01SEPLAR  8 

3.52 

RELATIONSHIP 

DETECTED 

01 SEP LAN  11  — 

01 SEPLAN  20 

3.48 

RELATIONSHIP 

DETECTED 

OlFTRfOT  — 

OTACDTCE  2 

3.45 

RELATIONSHIP 

DETECTED 

01ACSCHG  — 

oirnu9 

3.45 

RELATIONSHIP 

DETECTED 

• 

01 SEPLAN  2  — 

01 SEP LAN  24 

3.44 

RELATIONSHIP 

DETECTED 

ountrSt  — 

OTACDTCl  1 

3.44 

RELATIONSHIP 

DETECTED 

OlAMARCO  — 

01 SEPLAN  2 

3.44 

RELATIONSHIP 

DETECTED 

01 SEPLAN  0  — 
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Contour  Plot  of  Probability  Density  Function 
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